key: cord-0278582-cbmtu6p9
authors: Ede, Jeffrey M.
title: Advances in Electron Microscopy with Deep Learning
date: 2021-01-04
journal: nan
DOI: 10.5281/zenodo.4429792
sha: 6467e6610f9d8cc102f79c106d517bdc50f67245
doc_id: 278582
cord_uid: cbmtu6p9

This doctoral thesis covers some of my advances in electron microscopy with deep learning. Highlights include a comprehensive review of deep learning in electron microscopy; large new electron microscopy datasets for machine learning, dataset search engines based on variational autoencoders, and automatic data clustering by t-distributed stochastic neighbour embedding; adaptive learning rate clipping to stabilize learning; generative adversarial networks for compressed sensing with spiral, uniformly spaced and other fixed sparse scan paths; recurrent neural networks trained to piecewise adapt sparse scan paths to specimens by reinforcement learning; improving signal-to-noise; and conditional generative adversarial networks for exit wavefunction reconstruction from single transmission electron micrographs. This thesis adds to my publications by presenting their relationships, reflections, and holistic conclusions. This copy of my thesis is typeset for online dissemination to improve readability, whereas the thesis submitted to the University of Warwick in support of my application for the degree of Doctor of Philosophy in Physics will be typeset for physical printing and binding.

6. Numbers of results per year returned by Dimensions.ai abstract searches for SEM, TEM, STEM, STM and REM qualitate their popularities. The number of results for 2020 is extrapolated using the mean rate before 14th July 2020.

7. Visual comparison of various normalization methods highlighting regions that they normalize. Regions can be normalized across batch, feature and other dimensions, such as height and width. 8 . Visualization of convolutional layers. a) Traditional convolutional layer where output channels are sums of biases and convolutions of weights with input channels. b) Depthwise separable convolutional layer where depthwise convolutions compute one convolution with weights for each input channel. Output channels are sums of biases and pointwise convolutions weights with depthwise channels. 9 . Two 96×96 electron micrographs a) unchanged, and filtered by b) a 5×5 symmetric Gaussian kernel with a 2.5 px standard deviation, c) a 3×3 horizontal Sobel kernel, and d) a 3×3 vertical Sobel kernel. Intensities in a) and b) are in [0, 1] , whereas intensities in c) and d) are in [-1, 1] .

S3. Two-dimensional tSNE visualization of the first 50 principal components of 17266 TEM images that have been downsampled to 96×96. The same grid is used to show a) map points and b) images at 500 randomly selected points.

S4. Two-dimensional tSNE visualization of the first 50 principal components of 36324 exit wavefunctions that have been downsampled to 96×96. Wavefunctions were simulated for thousands of materials and a large range of physical hyperparameters. The same grid is used to show a) map points and b) wavefunctions at 500 randomly selected points. Red and blue colour channels show real and imaginary components, respectively.

S5. Two-dimensional tSNE visualization of the first 50 principal components of 11870 exit wavefunctions that have been downsampled to 96×96. Wavefunctions were simulated for thousands of materials and a small range of physical hyperparameters. The same grid is used to show a) map points and b) wavefunctions at 500 randomly selected points. Red and blue colour channels show real and imaginary components, respectively.

S6. Two-dimensional tSNE visualization of the first 50 principal components of 4825 exit wavefunctions that have been downsampled to 96×96. Wavefunctions were simulated for thousands of materials and a small range of physical hyperparameters. The same grid is used to show a) map points and b) wavefunctions at 500 randomly selected points. Red and blue colour channels show real and imaginary components, respectively.

S7. Two-dimensional tSNE visualization of means parameterized by 64-dimensional VAE latent spaces for 19769 STEM images that have been downsampled to 96×96. The same grid is used to show a) map points and b) images at 500 randomly selected points.

S8. Two-dimensional tSNE visualization of means parameterized by 64-dimensional VAE latent spaces for 19769 96×96 crops from STEM images. The same grid is used to show a) map points and b) images at 500 randomly selected points.

S9. Two-dimensional tSNE visualization of means parameterized by 64-dimensional VAE latent spaces for 19769 TEM images that have been downsampled to 96×96. The same grid is used to show a) map points and b) images at 500 randomly selected points. S10. Two-dimensional tSNE visualization of means and standard deviations parameterized by 64-dimensional VAE latent spaces for 19769 96×96 crops from STEM images. The same grid is used to show a) map points and b) images at 500 randomly selected points. S11. Two-dimensional uniformly separated tSNE visualization of 64-dimensional VAE latent spaces for 19769 96×96 crops from STEM images. S12. Two-dimensional uniformly separated tSNE visualization of 64-dimensional VAE latent spaces for 19769 STEM images that have been downsampled to 96×96. S13. Two-dimensional uniformly separated tSNE visualization of 64-dimensional VAE latent spaces for 17266 TEM images that have been downsampled to 96×96. S14. Examples of top-5 search results for 96×96 TEM images. Euclidean distances between µ encoded for search inputs and results are smaller for more similar images.

S15. Examples of top-5 search results for 96×96 STEM images. Euclidean distances between µ encoded for search inputs and results are smaller for more similar images.

Chapter 3 Adaptive Learning Rate Clipping Stabilizes Learning x 1. Unclipped learning curves for 2× CIFAR-10 supersampling with batch sizes 1, 4, 16 and 64 with and without adaptive learning rate clipping of losses to 3 standard deviations above their running means. Training is more stable for squared errors than quartic errors. Learning curves are 500 iteration boxcar averaged.

2. Unclipped learning curves for 2× CIFAR-10 supersampling with ADAM and SGD optimizers at stable and unstably high learning rates, η. Adaptive learning rate clipping prevents loss spikes and decreases errors at unstably high learning rates. Learning curves are 500 iteration boxcar averaged.

3. Neural network completions of 512×512 scanning transmission electron microscopy images from 1/20 coverage blurred spiral scans. 4 . Outer generator losses show that ALRC and Huberization stabilize learning. ALRC lowers final mean squared error (MSE) and Huberized MSE losses and accelerates convergence. Learning curves are 2500 iteration boxcar averaged. 5 . Convolutional image 2× supersampling network with three skip-2 residual blocks.

6. Two-stage generator that completes 512×512 micrographs from partial scans. A dashed line indicates that the same image is input to the inner and outer generator. Large scale features developed by the inner generator are locally enhanced by the outer generator and turned into images. An auxiliary inner generator trainer restores images from inner generator features to provide direct feedback.

1. Examples of Archimedes spiral (top) and jittered gridlike (bottom) 512×512 partial scan paths for 1/10, 1/20, 1/40, and 1/100 px coverage.

2. Simplified multiscale generative adversarial network. An inner generator produces large-scale features from inputs. These are mapped to half-size completions by a trainer network and recombined with the input to generate full-size completions by an outer generator. Multiple discriminators assess multiscale crops from input images and full-size completions. This figure was created with Inkscape.

3. Adversarial and non-adversarial completions for 512×512 test set 1/20 px coverage blurred spiral scan inputs. Adversarial completions have realistic noise characteristics and structure whereas non-adversarial completions are blurry. The bottom row shows a failure case where detail is too fine for the generator to resolve. Enlarged 64×64 regions from the top left of each image are inset to ease comparison, and the bottom two rows show non-adversarial generators outputting more detailed features nearer scan paths.

4. Non-adversarial generator outputs for 512×512 1/20 px coverage blurred spiral and gridlike scan inputs. Images with predictable patterns or structure are accurately completed. Circles accentuate that generators cannot reliably complete unpredictable images where there is no information. This figure was created with Inkscape.

5. Generator mean squared errors (MSEs) at each output pixel for 20000 512×512 1/20 px coverage test set images. Systematic errors are lower near spiral paths for variants of MSE training, and are less structured for adversarial training. Means, µ, and standard deviations, σ, of all pixels in each image are much higher for adversarial outputs. Enlarged 64×64 regions from the top left of each image are inset to ease comparison, and to show that systematic errors for MSE training are higher near output edges. 6 . Test set root mean squared (RMS) intensity errors for spiral scans in [0, 1] xi Chapter 4 Supplementary Information: Partial Scanning Transmission Electron Microscopy with Deep Learning S1. Discriminators examine random w×w crops to predict whether complete scans are real or generated. Generators are trained by multiple discriminators with different w. This figure was created with Inkscape.

S2. Two-stage generator that completes 512×512 micrographs from partial scans. A dashed line indicates that the same image is input to the inner and outer generator. Large scale features developed by the inner generator are locally enhanced by the outer generator and turned into images. An auxiliary trainer network restores images from inner generator features to provide direct feedback. This figure was created with Inkscape.

S3. Learning curves. a) Training with an auxiliary inner generator trainer stabilizes training, and converges to lower than two-stage training with fine tuning. b) Concatenating beam path information to inputs decreases losses. Adding symmetric residual connections between strided inner generator convolutions and transpositional convolutions increases losses. c) Increasing sizes of the first inner and outer generator convolutional kernels does not decrease losses. d) Losses are lower after more interations, and a learning rate (LR) of 0.0004; rather than 0.0002. Labels indicate inner generator iterations -outer generator iterations -fine tuning iterations, and k denotes multiplication by 1000 e) Adaptive learning rate clipped quartic validation losses have not diverged from training losses after 10 6 iterations. f) Losses are lower for outputs in [0, 1] than for outputs in [-1, 1] if leaky ReLU activation is applied to generator outputs.

S4. Learning curves. a) Making all convolutional kernels 3×3, and not applying leaky ReLU activation to generator outputs does not increase losses. b) Nearest neighbour infilling decreases losses. Noise was not added to low duration path segments for this experiment. c) Losses are similar whether or not extra noise is added to low-duration path segments. d) Learning is more stable and converges to lower errors at lower learning rates (LRs). Losses are lower for spirals than grid-like paths, and lowest when no noise is added to low-intensity path segments. e) Adaptive momentum-based optimizers, ADAM and RMSProp, outperform non-adaptive momentum optimizers, including Nesterov-accelerated momentum. ADAM outperforms RMSProp; however, training hyperparameters and learning protocols were tuned for ADAM. Momentum values were 0.9. f) Increasing partial scan pixel coverages listed in the legend decreases losses.

S5. Adaptive learning rate clipping stabilizes learning, accelerates convergence and results in lower errors than Huberisation. Weighting pixel errors with their running or final mean errors is ineffective.

S6. Non-adversarial 512×512 outputs and blurred true images for 1/17.9 px coverage spiral scans selected with binary masks.

S7. Non-adversarial 512×512 outputs and blurred true images for 1/27.3 px coverage spiral scans selected with binary masks.

S8. Non-adversarial 512×512 outputs and blurred true images for 1/38.2 px coverage spiral scans selected with binary masks.

S9. Non-adversarial 512×512 outputs and blurred true images for 1/50.0 px coverage spiral scans selected with binary masks. S10. Non-adversarial 512×512 outputs and blurred true images for 1/60.5 px coverage spiral scans selected with binary masks. S11. Non-adversarial 512×512 outputs and blurred true images for 1/73.7 px coverage spiral scans selected with binary masks.

xii S12. Non-adversarial 512×512 outputs and blurred true images for 1/87.0 px coverage spiral scans selected with binary masks.

Chapter 5 Adaptive Partial Scanning Transmission Electron Microscopy with Reinforcement Learning 1. Example 8×8 partial scan with T = 5 straight path segments. Each segment in this example has 3 probing positions separated by d = 2 1/2 px and their starts are labelled by step numbers, t. Partial scans are selected from STEM images by sampling pixels nearest probing positions, even if the probing position is nominally outside an imaging region. 2. Microjob service platforms. The size of typical tasks varies for different platforms and some platforms specialize in preparing machine learning datasets.

Chapter 2 Warwick Electron Microscopy Datasets 1. Examples and descriptions of STEM images in our datasets. References put some images into context to make them more tangible to unfamiliar readers.

2. Examples and descriptions of TEM images in our datasets. References put some images into context to make them more tangible to unfamiliar readers.

Chapter 2 Supplementary Information: Warwick Electron Microscopy Datasets S1. To ease comparison, we have tabulated figure numbers for tSNE visualizations. Visualizations are for principal components, VAE latent space means, and VAE latent space means weighted by standard deviations.

Chapter 3 Adaptive Learning Rate Clipping Stabilizes Learning 1. Adaptive learning rate clipping (ALRC) for losses 2, 3, 4 and ∞ running standard deviations above their running means for batch sizes 1, 4, 16 and 64. ARLC was not applied for clipping at ∞. Each squared and quartic error mean and standard deviation is for the means of the final 5000 training errors of 10 experiments. ALRC lowers errors for unstable quartic error training at low batch sizes and otherwise has little effect. Means and standard deviations are multiplied by 100.

2. Means and standard deviations of 20000 unclipped test set MSEs for STEM supersampling networks trained with various learning rate clipping algorithms and clipping hyperparameters, n ↑ and n ↓ , above and below, respectively.

Chapter 4 Partial Scanning Transmission Electron Microscopy with Deep Learning 1. Means and standard deviations of pixels in images created by takings means of 20000 512×512 test set squared difference images with intensities in [-1, 1] for methods to decrease systematic spatial error variation. Variances of Laplacians were calculated after linearly transforming mean images to unit variance.

Chapter 6 Improving Electron Micrograph Signal-to-Noise with an Atrous Convolutional Encoder-Decoder xvii 1. Mean MSE and SSIM for several denoising methods applied to 20000 instances of Poisson noise and their standard errors. All methods were implemented with default parameters. Gaussian: 3×3 kernel with a 0.8 px standard deviation. Bilateral: 9×9 kernel with radiometric and spatial scales of 75 (scales below 10 have little effect while scales above 150 cartoonize images). Median: 3×3 kernel. Wiener: no parameters. Wavelet: BayesShrink adaptive wavelet soft-thresholding with wavelet detail coefficient thresholds estimated using . Chambolle xviii Acknowledgments Most modern research builds on a high variety of intellectual contributions, many of which are often overlooked as there are too many to list. Examples include search engines, programming languages, machine learning frameworks, programming libraries, software development tools, computational hardware, operating systems, computing forums, research archives, and scholarly papers. To help developers with limited familiarity, useful resources for deep learning in electron microscopy are discussed in a review paper covered by ch. 1 of my thesis. For brevity, these acknowledgments will focus on personal contributions to my development as a researcher.

• Thanks go to Jeremy Sloan and Richard Beanland for supervision, internal peer review, and co-authorship.

• Thanks go to my Feedback Supervisors, Emma MacPherson and Jon Duffy, for comments needed to partially fulfil requirements of Doctoral Skills Modules (DSMs).

• I am grateful to Marin Alexe and Dong Jik Kim for supervising me during a summer project where I programmed various components of atomic force microscopes. It was when I first realized that I want to be a programmer. Before then, I only thought of programming as something that I did in my spare time.

• I am grateful to James Lloyd-Hughes for supervising me during a summer project where I automated Fourier analysis of ultrafast optical spectroscopy signals.

• I am grateful to my family for their love and support.

As a special note, I first taught myself machine learning by working through Mathematica documentation, implementing every machine learning example that I could find. The practice made use of spare time during a two-week course at the start of my Doctor of Philosophy (PhD) studentship, which was needed to partially fulfil requirements of the Midlands Physics Alliance Graduate School (MPAGS).

This thesis covers a subset of my scientific papers on advances in electron microscopy with deep learning. The papers were prepared while I was a PhD student at the University of Warwick in support of my application for the degree of PhD in Physics. This thesis reflects on my research, unifies covered publications, and discusses future research directions. My papers are available as part of chapters of this thesis, or from their original publication venues with hypertext and other enhancements. This preface covers my initial motivation to investigate deep learning in electron microscopy, structure and content of my thesis, and relationships between included publications. Traditionally, physics PhD theses submitted to the University of Warwick are formatted for physical printing and binding. However, I have also formatted a copy of my thesis for online dissemination to improve readability 37 .

When I started my PhD in October 2017, we were unsure if or how machine learning could be applied to electron microscopy. My PhD was funded by EPSRC Studentship 1917382 38 titled "Application of Novel Computing and Data Analysis Methods in Electron Microscopy", which is associated with EPSRC grant EP/N035437/1 39 titled "ADEPT -Advanced Devices by ElectroPlaTing". As part of the grant, our initial plan was for me to spend a couple of days per week using electron microscopes to analyse specimens sent to the University of Warwick from the University of Southampton, and to invest remaining time developing new computational techniques to help with analysis. However, an additional scientist was not needed to analyse specimens, so it was difficult for me to get electron microscopy training. While waiting for training, I was tasked with automating analysis of digital large angle convergent beam electron diffraction 40 (D-LACBED) patterns. However, we did not have a compelling use case for my D-LACBED software 26, 41 . Further, a more senior PhD student at the University of Warwick, Alexander Hubert, was already investigating convergent beam electron diffraction 40, 42 (CBED). My first machine learning research began five months after I started my PhD. Without a clear research direction or specimens to study, I decided to develop artificial neural networks (ANNs) to generate artwork. My dubious plan was to create image processing pipelines for the artwork, which I would replace with electron micrographs when I got specimens to study. However, after investigating artwork generation with randomly initialized multilayer perceptrons 43, 44 , then by style transfer 45, 46 , and then by fast style transfer 47 , there were still no specimens for me to study. Subsequently, I was inspired by NVIDIA's research on semantic segmentation 48 to investigate semantic segmentation with DeepLabv3+ 49 . However, I decided that it was unrealistic for me to label a large new electron microscopy dataset for semantic segmentation by myself. Fortunately, I had read about using deep neural networks (DNNs) to reduce image compression artefacts 50 , so I wondered if a similar approach based on DeepLabv3+ could improve electron micrograph signal-to-noise. Encouragingly, it would not require time-consuming image labelling. Following a successful investigation into improving signal-to-noise, my first scientific paper 6 (ch. 6) was submitted a few months later, and my experience with deep learning enabled subsequent investigations. and raising awareness to reduce unnecessary duplication of research. Empirically, there are no significant textual differences between arXiv preprints and corresponding journal papers 78 . However, journal papers appear to be slightly higher quality than biomedical preprints 78, 79 , suggesting that formatting and copyediting practices vary between scientific disciplines. Overall, I think that a lack of differences between journal papers and preprints may be a result of publishers separating language editing into premium services [80] [81] [82] [83] , rather than including extensive language editing in their usual publication processes. Increasing textual quality is correlated with increasing likelihood that an article will be published 84 . However, most authors appear to be performing copyediting themselves to avoid extra fees.

A secondary benefit of posting arXiv preprints is that their metadata, an article in portable document format 85, 86 (PDF) , and any Latex source files are openly accessible. This makes arXiv files easy to reuse, especially if they are published under permissive licenses 87 . For example, open accessibility enabled arXiv files to be curated into a large dataset 88 that was used to predict future research trends 89 . Further, although there is no requirement for preprints to peer reviewed, preprints can enable early access to papers that have been peer reviewed. As a case in point, all preprints covered by my thesis have been peer reviewed. Further, the arXiv implicitly supports peer review by providing contact details of authors, and I have both given and received feedback about arXiv papers. In addition, open peer review platforms 90 , such as OpenReview 91, 92 , can be used to explicitly seek peer review. There is also interest in integrating peer review with the arXiv, so a conceptual peer review model has been proposed 93 This thesis covers a selection of my interconnected scientific papers. Word counts for my papers and covering text are tabulated in table 1. Figures are included in word counts by adding products of nominal word densities and figure areas. However, acknowledgements, references, tables, supplementary information, and similar contents are not included as they do not count towards my thesis length limit of 70000 words. For details, notes on my word counting procedure are openly accessible 28 . Associated research outputs, such as source code and datasets, are not directly included in my thesis due to format restrictions. Nevertheless, my source code is openly accessible from GitHub 94 , and archived releases of my source code are openly accessible from Zenodo 95 . In addition, links to openly accessible pretrained models are provided in my source code documentation. Finally, links to openly accessible datasets are in my papers, source code documentation, and datasets paper 2 (ch. 2). often minimized by hardware. For example, by using aberration correctors 136, [157] [158] [159] , choosing scanning transmission electron microscopy (STEM) scan shapes and speeds that minimize distortions 138 , and using stable sample holders to reduce drift 160 . Beam damage can also be reduced by using minimal electron voltage and electron dose [161] [162] [163] , or dose-fractionation across multiple frames in multi-pass transmission electron microscopy [164] [165] [166] (TEM) or STEM 167 .

Deep learning is being applied to improve signal-to-noise for a variety of applications [168] [169] [170] [171] [172] [173] [174] [175] [176] . Most approaches in electron microscopy involve training ANNs to either map low-quality experimental 177 , artificially deteriorated 70, 178 or synthetic [179] [180] [181] [182] inputs to paired high-quality experimental measurements. For example, applications of a DNN trained with artificially deteriorated TEM images are shown in figure 1. However, ANNs have also been trained with unpaired datasets of lowquality and high-quality electron micrographs 183 , or pairs of low-quality electron micrographs 184, 185 . Another approach is Noise2Void 168 , ANNs are trained from single noisy images. However, Noise2Void removes information by masking noisy input pixels corresponding to target output pixels. So far, most ANNs that improve electron microscope signal-to-noise have been trained to decrease statistical noise 70, 177, 179-181, 181-184, 186 as other approaches have been developed to correct electron microscope scan distortions 187, 188 and specimen drift 141, 188, 189 . However, we anticipate that ANNs will be developed to correct a variety of electron microscopy noise as ANNs have been developed for aberration correction of optical microscopy [190] [191] [192] [193] [194] [195] and photoacoustic 196 signals.

Compressed sensing [203] [204] [205] [206] [207] is the efficient reconstruction of a signal from a subset of measurements. Applications include faster medical imaging [208] [209] [210] , image compression 211, 212 , increasing image resolution 213, 214 , lower medical radiation exposure [215] [216] [217] , and low-light vision 218, 219 . In STEM, compressed sensing has enabled electron beam exposure and scan time to be decreased by 10-100× with minimal information loss 201, 202 . Thus, compressed sensing can be essential to investigations where the high current density of electron probes damages specimens 161, [220] [221] [222] [223] [224] [225] [226] . Even if the effects of beam damage can be corrected by postprocessing, the damage to specimens is often permanent. Examples of beam-sensitive materials include organic crystals 227 , metal-organic frameworks 228 , nanotubes 229 , and nanoparticle dispersions 230 . In electron microscopy, compressed sensing is 2/98 Figure 2 . Example applications of DNNs to restore 512×512 STEM images from sparse signals. Training as part of a generative adversarial network [197] [198] [199] [200] yields more realistic outputs than training a single DNN with mean squared errors. Enlarged 64×64 regions from the top left of each crop are shown to ease comparison. a) Input is a Gaussian blurred 1/20 coverage spiral 201 . b) Input is a 1/25 coverage grid 202 . This figure is adapted from our earlier works under Creative Commons Attribution 4.0 73 licenses. especially effective due to high signal redundancy 231 . For example, most electron microscopy images are sampled at 5-10× their Nyquist rates 232 to ease visual inspection, decrease sub-Nyquist aliasing 233 , and avoid undersampling.

Perhaps the most popular approach to compressed sensing is upsampling or infilling a uniformly spaced grid of signals [234] [235] [236] . Interpolation methods include Lancsoz 234 , nearest neighbour 237 , polynomial interpolation 238 , Wiener 239 and other resampling methods [240] [241] [242] . However, a variety of other strategies to minimize STEM beam damage have also been proposed, including dose fractionation 243 and a variety of sparse data collection methods 244 . Perhaps the most intensively investigated approach to the latter is sampling a random subset of pixels, followed by reconstruction using an inpainting algorithm [244] [245] [246] [247] [248] [249] . Random sampling of pixels is nearly optimal for reconstruction by compressed sensing algorithms 250 . However, random sampling exceeds the design parameters of standard electron beam deflection systems, and can only be performed by collecting data slowly 138, 251 , or with the addition of a fast deflection or blanking system 247, 252 .

Sparse data collection methods that are more compatible with conventional STEM electron beam deflection systems have also been investigated. For example, maintaining a linear fast scan deflection whilst using a widely-spaced slow scan axis with some small random 'jitter' 245, 251 . However, even small jumps in electron beam position can lead to a significant difference between nominal and actual beam positions in a fast scan. Such jumps can be avoided by driving functions with continuous derivatives, such as those for spiral and Lissajous scan paths 138, 201, 247, 253, 254 . Sang 138, 254 considered a variety of scans including Archimedes and Fermat spirals, and scans with constant angular or linear displacements, by driving electron beam deflectors with a field-programmable gate array 255 (FPGA) based system 138 . Spirals with constant angular velocity place the least demand on electron beam deflectors. However, dwell times, and therefore electron dose, decreases with radius. Conversely, spirals created with constant spatial speeds are prone to systematic image distortions due to lags in deflector responses. In practice, fixed doses are preferable as they simplify visual inspection and limit the dose dependence of STEM noise 129 .

Deep learning can leverage an understanding of physics to infill images [256] [257] [258] . Example applications include increasing scanning electron microscopy 178, 259, 260 (SEM), STEM 202, 261 and TEM 262 resolution, and infilling continuous sparse scans 201 . Example applications of DNNs to complete sparse spiral and grid scans are shown in figure 2 . However, caution should be used when infilling large regions as ANNs may generate artefacts if a signal is unpredictable 201 . A popular alternative to deep learning for infilling large regions is exemplar-based infilling [263] [264] [265] [266] . However, exemplar-based infilling often leaves artefacts 267 

4 and is usually limited to leveraging information from single images. Smaller regions are often infilled by fast marching 268 , Navier-Stokes infilling 269 , or interpolation 238 .

Deep learning has been the basis of state-of-the-art classification [270] [271] [272] [273] since convolutional neural networks (CNNs) enabled a breakthrough in classification accuracy on ImageNet 71 . Most classifiers are single feedforward neural networks (FNNs) that learn to predict discrete labels. In electron microscopy, applications include classifying image region quality 274, 275 , material structures 276, 277 , and image resolution 278 . However, siamese [279] [280] [281] and dynamically parameterized 282 networks can more quickly learn to recognise images. Finally, labelling ANNs can learn to predict continuous features, such as mechanical properties 283 . Labelling ANNs are often combined with other methods. For example, ANNs can be used to automatically identify particle locations 186, [284] [285] [286] to ease subsequent processing. 

Semantic segmentation is the classification of pixels into discrete categories. In electron microscopy, applications include the automatic identification of local features 288, 289 , such as defects 290, 291 , dopants 292 , material phases 293 , material structures 294, 295 , dynamic surface phenomena 296 , and chemical phases in nanoparticles 297 . Early approaches to semantic segmentation used simple rules. However, such methods were not robust to a high variety of data 298 . Subsequently, more adaptive algorithms based on soft-computing 299 and fuzzy algorithms 300 were developed to use geometric shapes as priors. However, these methods were limited by programmed features and struggled to handle the high variety of data.

To improve performance, DNNs have been trained to semantically segment images [301] [302] [303] [304] [305] [306] [307] [308] . Semantic segmentation DNNs have been developed for focused ion beam scanning electron microscopy 309-311 (FIB-SEM), SEM 311-314 , STEM 287, 315 , and TEM 286, 310, 311, [316] [317] [318] [319] . For example, applications of a DNN to semantic segmentation of STEM images of steel are shown in figure 3 . Deep learning based semantic segmentation also has a high variety of applications outside of electron microscopy, including autonomous driving 320-324 , dietary monitoring 325, 326 , magnetic resonance images 327-331 , medical images 332-334 such as prenatal ultrasound [335] [336] [337] [338] , and satellite image translation [339] [340] [341] [342] [343] . Most DNNs for semantic segmentation are trained with images segmented by humans. However, human labelling may be too expensive, time-consuming, or inappropriate for sensitive data. Unsupervised semantic segmentation can avoid these difficulties by learning to segment images from an additional dataset of segmented images 344 or image-level labels [345] [346] [347] [348] . However, unsupervised semantic segmentation networks are often less accurate than supervised networks. 

Electrons exhibit wave-particle duality 350, 351 , so electron propagation is often described by wave optics 352 . Applications of electron wavefunctions exiting materials 353 include determining projected potentials and corresponding crystal structure information 354, 355 , information storage, point spread function deconvolution, improving contrast, aberration correction 356 , thickness measurement 357 , and electric and magnetic structure determination 358, 359 . Usually, exit wavefunctions are either iteratively reconstructed from focal series 360-364 or recorded by electron holography 352, 363, 365 . However, iterative reconstruction is often too slow for live applications, and holography is sensitive to distortions and may require expensive microscope modification.

Non-iterative methods based on DNNs have been developed to reconstruct optical exit wavefunctions from focal series 69 or single images [366] [367] [368] . Subsequently, DNNs have been developed to reconstruct exit wavefunctions from single TEM images 349 , as shown in figure 4 . Indeed, deep learning is increasingly being applied to accelerated quantum mechanics [369] [370] [371] [372] [373] [374] . Other examples of DNNs adding new dimensions to data include semantic segmentation described in section 1.4, and reconstructing 3D atomic distortions from 2D images 375 . Non-iterative methods that do not use ANNs to recover phase information from single images have also been developed 376, 377 . However, they are limited to defocused images in the Fresnel regime 376 , or to non-planar incident wavefunctions in the Fraunhofer regime 377 . 5/98 6 

Access to scientific resources is essential to scientific enterprise 378 . Fortunately, most resources needed to get started with machine learning are freely available. This section provides directions to various machine learning resources, including how to access deep learning frameworks, a free GPU or tensor processing unit (TPU) to accelerate tensor computations, platforms that host datasets and source code, and pretrained models. To support the ideals of open science embodied by Plan S 378-380 , we focus on resources that enhance collaboration and enable open access 381 . We also discuss how electron microscopes can interface with ANNs and the importance of machine learning resources in the context of electron microscopy. However, we expect that our insights into electron microscopy can be generalized to other scientific fields.

A DNN is an ANN with multiple layers that perform a sequence of tensor operations. Tensors can either be computed on central processing units (CPUs) or hardware accelerators 62 , such as FPGAs 382-385 , GPUs 386-388 , and TPUs 389-391 . Most benchmarks indicate that GPUs and TPUs outperform CPUs for typical DNNs that could be used for image processing 392-396 in electron microscopy. However, GPU and CPU performance can be comparable when CPU computation is optimized 397 . TPUs often outperform GPUs 394 , and FPGAs can outperform GPUs 398, 399 if FPGAs have sufficient arithmetic units 400, 401 . Typical power consumption per TFLOPS 402 decreases in order CPU, GPU, FPGA, then TPU, so hardware acceleration can help to minimize long-term costs and environmental damage 403 .

For beginners, Google Colab 404-407 and Kaggle 408 provide hardware accelerators in ready-to-go deep learning environments. Free compute time on these platforms is limited as they are not intended for industrial applications. Nevertheless, the free compute time is sufficient for some research 409 . For more intensive applications, it may be necessary to get permanent access to hardware accelerators. If so, many online guides detail how to install 410 (WEKA) and ZeroCostDL4Mic 501 . The GUIs offer less functionality and scope for customization than programming interfaces. However, GUI-based DLFs are rapidly improving. Moreover, existing GUI functionality is more than sufficient to implement popular FNNs, such as image classifiers 272 and encoder-decoders 305-308, 502-504 .

Training ANNs is often time-consuming and computationally expensive 403 . Fortunately, pretrained models are available from a range of open access collections 505 , such as Model Zoo 506 , Open Neural Network Exchange 507-510 (ONNX) Model Zoo 511 , TensorFlow Hub 512, 513 , and TensorFlow Model Garden 514 . Some researchers also provide pretrained models via project repositories 70, 201, 202, 231, 349 . Pretrained models can be used immediately or to transfer learning 515-521 to new applications. For example, by fine-tuning and augmenting the final layer of a pretrained model 522 . Benefits of transfer learning can include decreasing training time by orders of magnitude, reducing training data requirements, and improving generalization 520, 523 .

Using pretrained models is complicated by ANNs being developed with a variety of DLFs in a range of programming languages. However, most DLFs support interoperability. For example, by supporting the saving of models to a common format or to formats that are interoperable with the Neural Network Exchange Format 524 (NNEF) or ONNX formats. Many DLFs also support saving models to HDF5 525, 526 , which is popular in the pycroscopy 527, 528 and HyperSpy 529, 530 libraries used by electron microscopists. The main limitation of interoperability is that different DLFs may not support the same functionality. For example, Dlib 431, 432 does not support recurrent neural networks 531-536 (RNNs).

Randomly initialized ANNs 537 must be trained, validated, and tested with large, carefully partitioned datasets to ensure that they are robust to general use 538 . Most ANN training starts from random initialization, rather than transfer learning 515-521 , as:

1. Researchers may be investigating modifications to ANN architecture or ability to learn.

2. Pretrained models may be unavailable or too difficult to find.

3. Models may quickly achieve sufficient performance from random initialization. For example, training an encoder-decoder based on Xception 539 to improve electron micrograph signal-to-noise 70 To achieve high performance, it may be necessary to curate a large dataset for ANN training 2 . However, large datasets like DeepMind Kinetics 602 , ImageNet 565 , and YouTube 8M 603 may take a team months to prepare. As a result, it may not be practical to divert sufficient staff and resources to curate a high-quality dataset, even if curation is partially automated [603] [604] [605] [606] [607] [608] [609] [610] . To curate data, human capital can be temporarily and cheaply increased by using microjob services 611 . For example, through microjob platforms tabulated in table 2. Increasingly, platforms are emerging that specialize in data preparation for machine Table 2 . Microjob service platforms. The size of typical tasks varies for different platforms and some platforms specialize in preparing machine learning datasets.

learning. Nevertheless, microjob services may be inappropriate for sensitive data or tasks that require substantial domain-specific knowledge.

Software is part of our cultural, industrial, and scientific heritage 612 624 . These platforms enhance collaboration with functionality that helps users to watch 625 and contribute improvements [626] [627] [628] [629] [630] [631] [632] to source code. The choice of platform is often not immediately important for small electron microscopy projects as most platforms offer similar functionality. Nevertheless, functionality comparisons of open source platforms are available [633] [634] [635] . For beginners, we recommend GitHub as it is actively developed, scalable to large projects and has an easy-to-use interface.

Most web traffic 636, 637 goes to large-scale web search engines 638-642 such as Bing, DuckDuckGo, Google, and Yahoo. This includes searches for scholarly content [643] [644] [645] . We recommend Google for electron microscopy queries as it appears to yield the best results for general [646] [647] [648] , scholarly 644, 645 and other 649 queries. However, general search engines can be outperformed by dedicated search engines for specialized applications. For example, for finding academic literature [650] [651] [652] , data 653 , jobs 654, 655 , publication venues 656 , patents 657-660 , people [661] [662] [663] , and many other resources. The use of search engines is increasingly political 664-666 as they influence which information people see. However, most users appear to be satisfied with their performance 667 .

Introductory textbooks are outdated 668, 669 insofar that most information is readily available online. We find that some websites are frequent references for up-to-date and practical information: 7. Distill 686 is a journal dedicated to providing clear explanations about machine learning. Monetary prizes are awarded for excellent communication and refinement of ideas.

This list enumerates popular resources that we find useful, so it may introduce personal bias. However, alternative guides to useful resources are available 687-689 . We find that the most common issues finding information are part of an ongoing reproducibility crisis 690, 691 where machine learning researchers do not publish their source code or data. Nevertheless, third party source code is sometimes available. Alternatively, ANNs can reconstruct source code from some research papers 692 .

The number of articles published per year in reputable peer-reviewed 693-697 scientific journals 698, 699 has roughly doubled every nine years since the beginning of modern science 700 . There are now over 25000 peer-reviewed journals 699 with varying impact factors [701] [702] [703] , scopes and editorial policies. Strategies to find the best journal to publish in include using online journal finders 704 , seeking the advice of learned colleagues, and considering where similar research has been published. Increasingly, working papers are also being published in open access preprint archives [705] [706] [707] . For example, the arXiv 708, 709 is a popular preprint archive for computer science, mathematics, and physics. Advantages of preprints include ensuring that research is openly available, increasing discovery and citations [710] [711] [712] [713] [714] , inviting timely scientific discussion, and raising awareness to reduce unnecessary duplication of research. Many publishers have adapted to the popularity of preprints 705 by offering open access publication options [715] [716] [717] [718] and allowing, and in some cases encouraging 719 , the prior publication of preprints. Indeed, some journals are now using the arXiv to host their publications 720 . A variety of software can help authors prepare scientific manuscripts 721 . However, we think the most essential software is a document preparation system. Most manuscripts are prepared with Microsoft Word 722 or similar software 723 . However, Latex 724-726 is a popular alternative among computer scientists, mathematicians and physicists 727 . Most electron microscopists at the University of Warwick appear to prefer Word. A 2014 comparison of Latex and Word found that Word is better at all tasks other than typesetting equations 728 . However, in 2017 it become possible to use Latex to typeset equations within Word 727 . As a result, Word appears to be more efficient than Latex for most manuscript preparation. Nevertheless, Latex may still be preferable to authors who want fine control over typesetting 729, 730 . As a compromise, we use Overleaf 731 to edit Latex source code, then copy our code to Word as part of proofreading to identify issues with grammar and wording. 

An electron microscope is an instrument that uses electrons as a source of illumination to enable the study of small objects. Electron microscopy competes with a large range of alternative techniques for material analysis [732] [733] [734] , including atomic force microscopy [735] [736] [737] (AFM); Fourier transformed infrared (FTIR) spectroscopy 738, 739 ; nuclear magnetic resonance [740] [741] [742] [743] (NMR); Raman spectroscopy [744] [745] [746] [747] [748] [749] [750] ; and x-ray diffraction 751, 752 (XRD), dispersion 753 , fluorescence 754, 755 (XRF), and photoelectron spectroscopy 756, 757 (XPS) . Quantitative advantages of electron microscopes can include higher resolution and depth of field, and lower radiation damage than light microscopes 758 . In addition, electron microscopes can record images, enabling visual interpretation of complex structures that may otherwise be intractable. This section will briefly introduce varieties of electron microscopes, simulation software, and how electron microscopes can interface with ANNs. Figure 6 . Numbers of results per year returned by Dimensions.ai abstract searches for SEM, TEM, STEM, STM and REM qualitate their popularities. The number of results for 2020 is extrapolated using the mean rate before 14th July 2020.

There are a variety of electron microscopes that use different illumination mechanisms. For example, reflection electron microscopy 759, 760 (REM), scanning electron microscopy 761, 762 (SEM), scanning transmission electron microscopy 763, 764 (STEM), scanning tunnelling microscopy 765, 766 (STM), and transmission electron microscopy [767] [768] [769] . To roughly gauge popularities of electron microscope varieties, we performed abstract searches with Dimenions.ai 651, 770-772 for their abbreviations followed by "electron microscopy" e.g. "REM electron microscopy". Numbers of results per year in figure 6 qualitate that popularity increases in order REM, STM, STEM, TEM, then SEM. It may be tempting to attribute the popularity of SEM over TEM to the lower cost of SEM 773 , which increases accessibility. However, a range of considerations influence the procurement of electron microscopes 774 and hourly pricing at universities 775-779 is similar for SEM and TEM.

In SEM, material surfaces are scanned by sequential probing with a beam of electrons, which are typically accelerated to 0.2-40 keV. The SEM detects quanta emitted from where the beam interacts with the sample. Most SEM imaging uses low-energy secondary electrons. However, reflection electron microscopy 759, 760 (REM) uses elastically backscattered electrons and is often complimented by a combination of reflection high-energy electron diffraction [780] [781] [782] (RHEED), reflection highenergy electron loss spectroscopy 783, 784 (RHEELS) and spin-polarized low-energy electron microscopy [785] [786] [787] . Some SEMs also detect Auger electrons 788, 789 . To enhance materials characterization, most SEMs also detect light. The most common light detectors are for cathodoluminescence and energy dispersive r-ray 790, 791 (EDX) spectroscopy. Nonetheless, some SEMs also detect Bremsstrahlung radiation 792 .

Alternatively, TEM and STEM detect electrons transmitted through specimens. In conventional TEM, a single region is exposed to a broad electron beam. In contrast, STEM uses a fine electron beam to probe a series of discrete probing locations. Typically, electrons are accelerated across a potential difference to kinetic energies, E k , of 80-300 keV. Electrons also have rest energy E e = m e c 2 , where m e is electron rest mass and c is the speed of light. The total energy, E t = E e + E k , of free electrons is 10/98 related to their rest mass energy by a Lorentz factor, γ,

where v is the speed of electron propagation in the rest frame of an electron microscope. Electron kinetic energies in TEM and STEM are comparable to their rest energy, E e = 511 keV 793 , so relativistic phenomena 794, 795 must be considered to accurately describe their dynamics. Electrons exhibit wave-particle duality 350, 351 . Thus, in an ideal electron microscope, the maximum possible detection angle, θ , between two point sources separated by a distance, d, perpendicular to the electron propagation direction is diffraction-limited. The resolution limit for imaging can be quantified by Rayleigh's criterion [796] [797] [798] 

where resolution increases with decreasing wavelength, λ . Electron wavelength increases with increasing accelerating voltage, as described by the relativistic de Broglie relation [799] [800] [801] ,

where h is Planck's constant 793 . Electron wavelengths for typical acceleration voltages tabulated by JEOL are in picometres 802 . In comparison, Cu K-α x-rays, which are often used for XRD, have wavelengths near 0.15 nm 803 . In theory, electrons can therefore achieve over 100× higher resolution than x-rays. Electrons and x-rays are both ionizing; however, electrons often do less radiation damage to thin specimens than x-rays 758 . Tangentially, TEM and STEM often achieve over 10 times higher resolution than SEM 804 as transmitted electrons in TEM and STEM are easier to resolve than electrons returned from material surfaces in SEM. In practice, TEM and STEM are also limited by incoherence [805] [806] [807] introduced by inelastic scattering, electron energy spread, and other mechanisms. TEM and STEM are related by an extension of Helmholtz reciprocity 808, 809 where the source plane in a TEM corresponds to the detector plane in a STEM 810 , as shown in figure 5 . Consequently, TEM coherence is limited by electron optics between the specimen and image, whereas STEM coherence is limited by the illumination system. For conventional TEM and STEM imaging, electrons are normally incident on a specimen 811 . Advantages of STEM imaging can include higher contrast and resolution than TEM imaging, and lower radiation damage 812 . As a result, STEM is increasing being favoured over TEM for high-resolution studies. However, we caution that definitions of TEM and STEM resolution can be disparate 813 .

In addition to conventional imaging, TEM and STEM include a variety of operating modes for different applications. For example, TEM operating configurations include electron diffraction 814 ; convergent beam electron diffraction 815-817 (CBED); tomography 818-826 ; and bright field 768, 827-829 , dark field 768, 829 and annular dark field 830 imaging. Similarly, STEM operating configurations include differential phase contrast 831-834 ; tomography 818, 820, 822, 823 ; and bright field 835, 836 or dark field 837 imaging. Further, electron cameras 838, 839 are often supplemented by secondary signal detectors. For example, elemental composition is often mapped by EDX spectroscopy, electron energy loss spectroscopy 840, 841 (EELS) or wavelength dispersive spectroscopy 842, 843 (WDS). Similarly, electron backscatter diffraction 844-846 (EBSD) can detect strain 847-849 and crystallization 850-852 .

The propagation of electron wavefunctions though electron microscopes can be described by wave optics 136 . Following, the most popular approach to modelling measurement contrast is multislice simulation 853, 854 , where an electron wavefunction is iteratively perturbed as it travels through a model of a specimen. Multislice software for electron microscopy includes ACEM 854 [879] [880] [881] [882] [883] [884] . We find that most multislice software is a recreation and slight modification of common functionality, possibly due to a publish-or-perish culture in academia 885-887 . Bloch-wave simulation 854, 888-892 is an alternative to multislice simulation that can reduce computation time and memory requirements for crystalline materials 893 .

Most modern electron microscopes support Gatan Microscopy Suite (GMS) Software 894 . GMS enables electron microscopes to be programmed by DigitalMicrograph Scripting, a propriety Gatan programming language akin to a simplified version of C++.

A variety of DigitalMicrograph scripts, tutorials and related resources are available from Dave Mitchell's DigitalMicrograph Scripting Website 679, 680 , FELMI/ZFE's Script Database 895 and Gatan's Script library 896 . Some electron microscopists also provide DigitalMicrograph scripting resources on their webpages 897-899 . However, DigitalMicrograph scripts are slow insofar that they are interpreted at runtime, and there is limited native functionality for parallel and distributed computing. As a result, extensions to DigitalMicrograph scripting are often developed in other programming languages that offer more functionality.

Historically, most extensions were developed in C++ 900 . This was problematic as there is limited documentation, the standard approach used outdated C++ software development kits such as Visual Studio 2008, and programming expertise required to create functions that interface with DigitalMicrograph scripts limited accessibility. To increase accessibility, recent versions of GMS now support python 901 . This is convenient as it enables ANNs developed with python to readily interface with electron microscopes. For ANNs developed with C++, users have the option to either create C++ bindings for DigitalMicrograph script or for python. Integrating ANNs developed in other programming languages is more complicated as DigitalMicrograph provides almost no support. However, that complexity can be avoided by exchanging files from DigitalMicrograph script to external libraries via a random access memory (RAM) disk 902 or secondary storage 903 .

Increasing accessibility, there are collections of GMS plugins with GUIs for automation and analysis [897] [898] [899] 904 . In addition, various individual plugins are available 905-909 . Some plugins are open source, so they can be adapted to interface with ANNs. However, many high-quality plugins are proprietary and closed source, limiting their use to automation of data collection and processing. Plugins can also be supplemented by a variety of libraries and interfaces for electron microscopy signal processing. For example, popular general-purpose software includes ImageJ 910 , pycroscopy 527, 528 and HyperSpy 529, 530 . In addition, there are directories for tens of general-purpose and specific electron microscopy programs 911-913 .

Most modern ANNs are configured from a variety of DLF components. To take advantage of hardware accelerators 62 , most ANNs are implemented as sequences of parallelizable layers of tensor operations 914 . Layers are often parallelized across data and may be parallelized across other dimensions 915 . This section introduces popular nonlinear activation functions, normalization layers, convolutional layers, and skip connections. To add insight, we provide comparative discussion and address some common causes of confusion.

In general, DNNs need multiple layers to be universal approximators [37] [38] [39] [40] [41] [42] [43] [44] [45] . Nonlinear activation functions 916, 917 are therefore essential to DNNs as successive linear layers can be contracted to a single layer. Activation functions separate artificial neurons, similar to biological neurons 918 . To learn efficiently, most DNNs are tens or hundreds of layers deep 47, 919-921 . High depth increases representational capacity 47 , which can help training by gradient descent as DNNs evolve as linear models 922 and nonlinearities can create suboptimal local minima where data cannot be fit by linear models 923 . There are infinitely many possible activation functions. However, most activation functions have low polynomial order, similar to physical Hamiltonians 47 .

Most ANNs developed for electron microscopy are for image processing, where the most popular nonlinearities are rectifier linear units 924, 925 (ReLUs). The ReLU activation, f (x), of an input, x, and its gradient,

Popular variants of ReLUs include Leaky ReLU 926 ,

where α is a hyperparameter, parametric ReLU 22 (PreLU) where α is a learned parameter, dynamic ReLU where α is a learned function of inputs 927 , and randomized leaky ReLU 928 (RReLU) where α is chosen randomly. Typically, learned PreLU α are higher the nearer a layer is to ANN inputs 22 . Motivated by limited comparisons that do not show a clear performance difference between ReLU and leaky ReLU 929 , some blogs 930 argue against using leaky ReLU due to its higher computational requirements and complexity. However, an in-depth comparison found that leaky ReLU variants consistently slightly outperform ReLU 928 . In addition, the non-zero gradient of leaky ReLU for x ≤ 0 prevents saturating, or "dying", ReLU 931-933 , where the zero gradient of ReLUs stops learning.

13 There are a variety of other piecewise linear ReLU variants that can improve performance. For example, ReLUh activations are limited to a threshold 934 , h, so that

Thresholds near h = 6 are often effective, so popular choice is ReLU6. Another popular activation is concatenated ReLU 935 (CReLU), which is the concatenation of ReLU(x) and ReLU(−x). Other ReLU variants include adaptive convolutional 936 , bipolar 937 , elastic 938 , and Lipschitz 939 ReLUs. However, most ReLU variants are uncommon as they are more complicated than ReLU and offer small, inconsistent, or unclear performance gains. Moreover, it follows from the universal approximator theorems 37-45 that disparity between ReLU and its variants approaches zero as network depth increases.

In shallow networks, curved activation functions with non-zero Hessians often accelerate convergence and improve performance. A popular activation is the exponential linear unit 940 (ELU),

where α is a learned parameter. Further, a scaled ELU 941 (SELU), 

Sigmoids can also be applied to limit the support of outputs. Unscaled, or "logistic", sigmoids are often denoted σ (x) and are related to tanh by tanh(x) = 2σ (2x) − 1. To avoid expensive exp(−x) in the computation of tanh, we recommend K-tanH 950 , LeCun tanh 951 , or piecewise linear approximation 952, 953 .

The activation functions introduced so far are scalar functions than can be efficiently computed in parallel for each input element. However, functions of vectors, x = {x 1 , x 2 , ...}, are also popular. For example, softmax activation 954 ,

is often applied before computing cross-entropy losses for classification networks. Similarly, Ln vector normalization,

is often applied to n-dimensional vectors to ensure that they lie on a unit n-sphere 349 . Finally, max pooling 955, 956 ,

is another popular multivariate activation function that is often used for downsampling. However, max pooling has fallen out of favour as it is often outperformed by strided convolutional layers 957 . Other vector activation functions include squashing nonlinearities for dynamic routing by agreement in capsule networks 958 and cosine similarity 959 .

There is a range of other activation functions that are not detailed here for brevity. Further, finding new activation functions is an active area of research 960, 961 . Notable variants include choosing activation functions from a set before training 962, 963 and learning activation functions 962, 964-967 . Activation functions can also encode probability distributions 968-970 or include noise 953 . Finally, there are a variety of other deterministic activation functions 961, 971 . In electron microscopy, most ANNs enable new or enhance existing applications. Subsequently, we recommend using computationally efficient and established activation functions unless there is a compelling reason to use a specialized activation function.

Normalization 972-974 standardizes signals, which can accelerate convergence by gradient descent and improve performance. Batch normalization 975-980 is the most popular normalization layer in image processing DNNs trained with minibatches of N examples. Technically, a "batch" is an entire training dataset and a "minibatch" is a subset; however, the "mini" is often omitted where meaning is clear from context. During training, batch normalization applies a transform,

where x = {x 1 , ..., x N } is a batch of layer inputs, γ and β are a learnable scale and shift, and ε is a small constant added for numerical stability. During inference, batch normalization applies a transform, Increasing batch size stabilizes learning by averaging destabilizing loss spikes over batches 261 . Batched learning also enables more efficient utilization of modern hardware accelerators. For example, larger batch sizes improve utilization of GPU memory bandwidth and throughput 391, 981, 982 . Using large batches can also be more efficient than many small batches when distributing training across multiple CPU clusters or GPUs due to communication overheads. However, the performance benefits of large batch sizes can come at the cost of lower test accuracy as training with large batches tends to converge to sharper minima 983, 984 . As a result, it often best not to use batch sizes higher than N ≈ 32 for image classification 985 . However, learning rate scaling 541 and layer-wise adaptive learning rates 986 can increase accuracy of training with fixed larger batch sizes. Batch size can also be increased throughout training without compromising accuracy 987 to exploit effective learning rates being inversely proportional to batch size 541, 987 . Alternatively, accuracy can be improved by creating larger batches from replicated instances of training inputs with different data augmentations 988 .

There are a few caveats to batch normalization. Originally, batch normalization was applied before activation 976 . However, applying batch normalization after activation often slightly improves performance 989, 990 . In addition, training can be sensitive to the often-forgotten ε hyperparameter 991 in equation 16 . Typically, performance decreases as ε is increased above ε ≈ 0.001; however, there is a sharp increase in performance around ε = 0.01 on ImageNet. Finally, it is often assumed that batches are representative of the training dataset. This is often approximated by shuffling training data to sample independent and identically distributed (i.i.d.) samples. However, performance can often be improved by prioritizing sampling 992, 993 . We observe that batch normalization is usually effective if batch moments, µ B and σ B , have similar values for every batch.

Batch normalization is less effective when training batch sizes are small, or do not consist of independent samples. To improve performance, standard moments in equation 16 can be renormalized 994 to expected means, µ, and standard deviations,

where gradients are not backpropagated with respect to (w.r.t.) the renormalization parameters, r and d. Moments, µ and σ are tracked by exponential moving averages and clipping to r max and d max improves learning stability. Usually, clipping values are increased from starting values of r max = 1 and d max = 0, which correspond to batch normalization, as training progresses. Another approach is virtual batch normalization 995 (VBN), which estimates µ and σ from a reference batch of samples and does not require clipping. However, VBN is computationally expensive as it requires computing a second batch of statistics at every training iteration. Finally, online 996 and streaming 974 normalization enable training with small batch sizes by replace µ B and σ B in equation 16 with their exponential moving averages.

There are alternatives to the L 2 batch normalization of equations 14-18 that standardize to different Euclidean norms. For example, L 1 batch normalization 997 computes

where C L 1 = (π/2) 1/2 . Although the C L 1 factor could be learned by ANN parameters, its inclusion accelerates convergence of the original implementation of L 1 batch normalization 997 . Another alternative is L ∞ batch normalization 997 , which computes

where C L ∞ is a scale factor, and top k (x) returns the k highest elements of x. Hoffer et al suggest k = 10 997 . Some L 1 batch normalization proponents claim that L 1 batch normalization outperforms 975 or achieves similar performance 997 to L 2 batch normalization. However, we found that L 1 batch normalization often lowers performance in our experiments. Similarly, L ∞ batch normalization often lowers performance 997 . Overall, L 1 and L ∞ batch normalization do not appear to offer a substantial advantage over L 2 batch normalization. A variety of layers normalize samples independently, including layer, instance, and group normalization. They are compared with batch normalization in figure 7. Layer normalization 998, 999 is a transposition of batch normalization that is computed across feature channels for each training example, instead of across batches. Batch normalization is ineffective in RNNs; however, layer normalization of input activations often improves accuracy 998 . Instance normalization 1000 is an extreme version 15/98 of layer normalization that standardizes each feature channel for each training example. Instance normalization was developed for style transfer 1001-1005 and makes ANNs insensitive to input image contrast. Group normalization 1006 is intermediate to instance and layer normalization insofar that it standardizes groups of channels for each training example.

The advantages of a set of multiple different normalization layers, Ω, can be combined by switchable normalization 1007, 1008 , which standardizes tô

where µ z and σ z are means and standard deviations computed by normalization layer z, and their respective importance ratios, λ µ z and λ σ z , are trainable parameters that are softmax activated to sum to unity. Combining batch and instance normalization statistics outperforms batch normalization for a range of computer vision tasks 1009 . However, most layers strongly weighted either batch or instance normalization, with most preferring batch normalization. Interestingly, combining batch, instance and layer normalization statistics 1007, 1008 results in instance normalization being preferred in earlier layers, whereas layer normalization was preferred in the later layers, and batch normalization was preferred in the middle layers. Smaller batch sizes lead to a preference towards layer normalization and instance normalization. Limitingly, using multiple normalization layers increases computation. To limit expense, we therefore recommend either defaulting to batch normalization, or progressively using single instance, batch or layer normalization layers.

A significant limitation of batch normalization is that it is not effective in RNNs. This is a limited issue as most electron microscopists are developing CNNs for image processing. However, we anticipate that RNNs may become more popular in electron microscopy following the increasing popularity of reinforcement learning 1010 . In addition to general-purpose alternatives to batch normalization that are effective in RNNs, such as layer normalization, there are a variety of dedicated normalization schemes. For example, recurrent batch normalization 1011, 1012 uses distinct normalization layers for each time step. Alternatively, batch normalized RNNs 1013 only have normalization layers between their input and hidden states. Finally, online 996 and streaming 974 normalization are general-purpose solutions that improve the performance of batch normalization in RNNs by applying batch normalization based on a stream of past batch statistics.

Normalization can also standardize trainable weights, w. For example, weight normalization 1014 ,

decouples the L2 norm, g, of a variable from its direction. Similarly, weight standardization 1015 subtracts means from variables and divides them by their standard deviations,

similar to batch normalization. Weight normalization often outperforms batch normalization at small batch sizes. However, batch normalization consistently outperforms weight normalization at larger batch sizes used in practice 1016 . Combining weight normalization with running mean-only batch normalization can accelerate convergence 1014 . However, similar final accuracy can be achieved without mean-only batch normalization at the cost of slower convergence, or with the use of zero-mean preserving activation functions 937, 997 . To achieve similar performance to batch normalization, norm-bounded weight normalization 997 can be applied to DNNs with scale-invariant activation functions, such as ReLU. Norm-bounded weight normalization fixes g at initialization to avoid learning instability 997, 1016 , and scales outputs with the final DNN layer. Limitedly, weight normalization encourages the use of a small number of features to inform activations 1017 . To maximize feature utilization, spectral normalization 1017 ,

divides tensors by their spectral norms, σ (w). Further, spectral normalization limits Lipschitz constants 1018 , which often improves generative adversarial network [197] [198] [199] [200] (GAN) training by bounding backpropagated discriminator gradients 1017 . The spectral norm of v is the maximum value of a diagonal matrix, Σ, in the singular value decomposition [1019] [1020] [1021] [1022] 

where U and V are orthogonal matrices of orthonormal eigenvectors for vv T and v T v, respectively. To minimize computation, σ (w) is often approximated by the power iteration method 1023, 1024 ,

u ← wv

where one iteration of equations 31-32 per training iteration is usually sufficient. Parameter normalization can complement or be combined with signal normalization. For example, scale normalization 1025 ,

learns scales, g, for activations, and is often combined with weight normalization 1014, 1026 in transformer networks. Similarly, cosine normalization 959 ,

computes products of L2 normalized parameters and signals. Both scale and cosine normalization can outperform batch normalization. 

A convolutional neural network [1027] [1028] [1029] [1030] (CNN) is trained to weight convolutional kernels to exploit local correlations, such as spatial correlations in electron micrographs 231 

In general, the convolution of two functions, f and g, is

and their cross-correlation is

where integrals have unlimited support, Ω. In a CNN, convolutional layers sum convolutions of feature channels with trainable kernels, as shown in figure 8 . Thus, f and g are discrete functions and the integrals in equations 36-37 can be replaced with limited summations. Since cross-correlation is equivalent to convolution if the kernel is flipped in every dimension, and CNN kernels are usually trainable, convolution and cross-correlation is often interchangeable in deep learning. For example, a TensorFlow function named "tf.nn.convolution" computes cross-correlations 1056 . Nevertheless, the difference between convolution and cross-correlation can be source of subtle errors if convolutional layers from a DLF are used in an image processing pipeline with static asymmetric kernels. 

Alternatives to Gaussian kernels for image smoothing 1058 include mean, median and bilateral filters. Sobel kernels compute horizontal and vertical spatial gradients that can be used for edge detection 1059 . For example, 3×3 Sobel kernels are

Alternatives to Sobel kernels offer similar utility, and include extended Sobel 1060 , Scharr 1061, 1062 , Kayyali 1063 , Roberts cross 1064 and Prewitt 1065 kernels. Two-dimensional Gaussian and Sobel kernels are examples of linearly separable, or "flattenable", kernels, which can be split into two one-dimensional kernels, as shown in equations 38-39b. Kernel separation can decrease computation in convolutional layers by convolving separated kernels in series, and CNNs that only use separable convolutions are effective [1066] [1067] [1068] . However, serial convolutions decrease parallelization and separable kernels have fewer degrees of freedom, decreasing representational capacity. Following, separated kernels are usually at least 5×5, and separated 3×3 kernels are unusual. Even-sized kernels, such as 2×2 and 4×4, are rare as symmetric padding is needed to avoid information erosion caused by spatial shifts of feature maps 1069 .

A traditional 2D convolutional layer maps inputs, x input , with height H, width, W , and depth, D, to

where K output channels are indexed by k ∈ [1, K], is the sum of a bias, b, and convolutions of each input channel with M × N kernels with weights, w. For clarity, a traditional convolutional layer is visualized in figure 8a. Convolutional layers for 1D, 3D and higher-dimensional kernels 1070 

where every input element is connected to every output element. Convolutional layers reduce computation by making local connections within receptive fields of convolutional kernels, and by convolving kernels rather than using different weights at each input position. Intermediately, fully connected layers can be regularized to learn local connections 1085 . Fully connected layers are sometimes used at the middle of encoder-decoders 1086 . However, such fully connected layers can often be replaced by multiscale atrous, or "holey", convolutions 955 in an atrous spatial pyramid pooling 305, 306 (ASPP) module to decrease computation without a significant decrease in performance. Alternatively, weights in fully connected layers can be decomposed into multiple smaller tensors to decrease computation without significantly decreasing performance 1087, 1088 . Convolutional layers can perform a variety of convolutional arithmetic 955 . For example, strided convolutions 1089 usually skip computation of outputs that are not at multiples of an integer spatial stride. Most strided convolutional layers are applied throughout CNNs to sequentially decrease spatial extent, and thereby decrease computational requirements. In addition, strided convolutions are often applied at the start of CNNs 539, 1074-1076 where most input features can be resolved at a lower resolution than the input. For simplicity and computational efficiency, stride is typically constant within a convolutional layer; however, increasing stride away from the centre of layers can improve performance 1090 . To increase spatial resolution, convolutional layers often use reciprocals of integer strides 1091 . Alternatively, spatial resolution can be increased by combining interpolative upsampling with an unstrided convolutional layer 1092, 1093 , which can help to minimize output artefacts.

Convolutional layers couple the computation of spatial and cross-channel convolutions. However, partial decoupling of spatial and cross-channel convolutions by distributing inputs across multiple convolutional layers and combining outputs can improve performance. Partial decoupling of convolutions is prevalent in many seminal DNN architectures, including FractalNet 1073 , Inception 1074-1076 , NASNet 1077 . Taking decoupling to an extreme, depthwise separable convolutions 539, 1094, 1095 shown in figure 8 compute depthwise convolutions,

20 then compute pointwise 1×1 convolutions for D intermediate channels,

where K output channels are indexed by k ∈ [1, K]. Depthwise convolution kernels have weights, u, and the depthwise layer is often followed by extra batch normalization before pointwise convolution to improve performance and accelerate convergence 1094 . Increasing numbers of channels with pointwise convolutions can increase accuracy 1094 , at the cost of increased computation. Pointwise convolutions are a special case of traditional convolutional layers in equation 40 and have convolution kernel weights, v, and add biases, b. Naively, depthwise separable convolutions require fewer weight multiplications than traditional convolutions 1096, 1097 . However, extra batch normalization and serialization of one convolutional layer into depthwise and pointwise convolutional layers mean that depthwise separable convolutions and traditional convolutions have similar computing times 539, 1097 . Most DNNs developed for computer vision use fixed-size inputs. Although fixed input sizes are often regarded as an artificial constraint, it is similar to animalian vision where there is an effectively constant number of retinal rods and cones [1098] [1099] [1100] . Typically, the most practical approach to handle arbitrary image shapes is to train a DNN with crops so that it can be tiled across images. In some cases, a combination of cropping, padding and interpolative resizing can also be used. To fully utilize unmodified variable size inputs, a simple is approach to train convolutional layers on variable size inputs. A pooling layer, such as global average pooling, can then be applied to fix output size before fully connected or other layers that might require fixed-size inputs. More involved approaches include spatial pyramid pooling 1101 or scale RNNs 1102 . Typical electron micrographs are much larger than 300×300, which often makes it unfeasible for electron microscopists with a few GPUs to train high-performance DNNs on full-size images. For comparison, Xception was trained on 300×300 images with 60 K80 GPUs for over one month.

The Fourier transform 1103 ,f (k 1 , ..., k N ), at an N-dimensional Fourier space vector, {k 1 , ..., k N }, is related to a function,

f (x 1 , ...,

where π = 3.141..., and i = (−1) 1/2 is the imaginary number. Two parameters, a and b, can parameterize popular conventions that relate the Fourier and inverse Fourier transforms. Mathematica documentation nominates conventions 1104 for general applications (a, b), pure mathematics (1, −1), classical physics (−1, 1), modern physics (0, 1), systems engineering (1, −1), and signal processing (0, 2π). We observe that most electron microscopists follow the modern physics convention of a = 0 and b = 1; however, the choice of convention is arbitrary and does not matter if it is consistent within a project. For discrete functions, Fourier integrals are replaced with summations that are limited to the support of a function. Discrete Fourier transforms of uniformly spaced inputs are often computed with a fast Fourier transform (FFT) algorithm, which can be parallelized for CPUs 1105 or GPUs 65, [1106] [1107] [1108] . Typically, the speedup of FFTs on GPUs over CPUs is higher for larger signals 1109, 1110 . Most popular FFTs are based on the Cooley-Turkey algorithm 1111, 1112 , which recursively divides FFTs into smaller FFTs. We observe that some electron microscopists consider FFTs to be limited to radix-2 signals that can be recursively halved; however, FFTs can use any combination of factors for the sizes of recursively smaller FFTs. For example, clFFT 1113 FFT algorithms support signal sizes that are any sum of powers of 2, 3, 5, 7, 11 and 13.

Convolution theorems can decrease computation by enabling convolution in the Fourier domain 1114 . To ease notation, we denote the Fourier transform of a signal, I, by FT(I), and the inverse Fourier transform by FT −1 (I). Following, the convolution theorems for two signals, I 1 and I 2 , are 1115

where the signals can be feature channels and convolutional kernels. Fourier domain convolutions, I 1 * I 2 = FT −1 (FT(I 1 ) · FT(I 2 )), are increasingly efficient, relative to signal domain convolutions, as kernel and image sizes increase 1114 . Indeed, Fourier domain convolutions are exploited to enable faster training with large kernels in Fourier CNNs 1114, 1116 . However, Fourier CNNs are rare as most researchers use small 3×3 kernels, following University of Oxford Visual Geometry Group (VGG) CNNs 1117 . 20/98 Figure 10 . Residual blocks where a) one, b) two, and c) three convolutional layers are skipped. Typically, convolutional layers are followed by batch normalization then activation.

Residual connections 1080 add a signal after skipping ANN layers, similar to cortical skip connections 1118, 1119 . Residuals improve DNN performance by preserving gradient norms during backpropagation 537, 1120 and avoiding bad local minima 1121 by smoothing DNN loss landscapes 1122 . In practice, residuals enable DNNs to behave like an ensemble of shallow networks 1123 that learn to iteratively estimate outputs 1124 . Mathematically, a residual layer learns parameters, w l , of a perturbative function, f l (x l , w l ), that maps a signal, x l , at depth l to depth l + 1,

Residuals were developed for CNNs 1080 , and examples of residual connections that skip one, two and three convolutional layers are shown in figure 10 . Nonetheless, residuals are also used in MLPs 1125 and RNNs [1126] [1127] [1128] . Representational capacity of perturbative functions increases as the number of skipped layers increases. As result, most residuals skip two or three layers. Skipping one layer rarely improves performance due to its low representational capacity 1080 .

There are a range of residual connection variants that can improve performance. For example, highway networks 1129, 1130 apply a gating function to skip connections, and dense networks 1131-1133 use a high number of residual connections from multiple layers. Another example is applying a 1×1 convolutional layer to x l before addition 539, 1080 where f l (x l , w l ) spatially resizes or changes numbers of feature channels. However, resizing with norm-preserving convolutional layers 1120 before residual blocks can often improve performance. Finally, long additive 1134 residuals that connect DNN inputs to outputs are often applied to DNNs that learn perturbative functions.

A limitation of preserving signal information with residuals 1135, 1136 is that residuals make DNNs learn perturbative functions, which can limit accuracy of DNNs that learn non-perturbative functions if they do not have many layers. Feature channel concatenation is an alternative approach that not perturbative, and that supports combination of layers with different numbers of feature channels. In encoder-decoders, a typical example is concatenating features computed near the start with layers near the end to help resolve output features 305, 306, 308, 316 . Concatenation can also combine embeddings of different 1137, 1138 or variants of 366 input features by multiple DNNs. Finally, peephole connections in RNNs can improve performance by using concatenation to combine cell state information with other cell inputs 1139, 1140 .

There is a high variety of ANN architectures 4-7 that are trained to minimize losses for a range of applications. Many of the most popular ANNs are also the simplest, and information about them is readily available. For example, encoderdecoder 305-308, 502-504 or classifier 272 ANNs usually consist of single feedforward sequences of layers that map inputs to outputs. This section introduces more advanced ANNs used in electron microscopy, including actor-critics, GANs, RNNs, and variational autoencoders (VAEs). These ANNs share weights between layers or consist of multiple subnetworks. Other notable architectures include recursive CNNs 1078, 1079 , Network-in-Networks 1141 (NiNs), and transformers 1142, 1143 . Although they will not be detailed in this review, their references may be good starting points for research.

Most ANNs are trained by gradient descent using backpropagated gradients of a differentiable loss function cf. section 6.1. However, some losses are not differentiable. Examples include losses of actors directing their vision 1144, 1145 , and playing competitive 24 or score-based 1146, 1147 computer games. To overcome this limitation, a critic 1148 can be trained to predict 21/98 Figure 11 . Actor-critic architecture. An actor outputs actions based on input states. A critic then evaluates action-state pairs to predict losses. differentiable losses from action and state information, as shown in figure 11 . If the critic does not depend on states, it is a surrogate loss function 1149, 1150 . Surrogates are often fully trained before actor optimization, whereas critics that depend on actor-state pairs are often trained alongside actors to minimize the impact of catastrophic forgetting 1151 by adapting to changing actor policies and experiences. Alternatively, critics can be trained with features output by intermediate layers of actors to generate synthetic gradients for backpropagation 1152 .

Generative adversarial network architecture. A generator learns to produce outputs that look realistic to a discriminator, which learns to predict whether examples are real or generated.

Generative adversarial networks [197] [198] [199] [200] (GANs) consist of generator and discriminator subnetworks that play an adversarial game, as shown in figure 12 . Generators learn to generate outputs that look realistic to discriminators, whereas discriminators learn to predict whether examples are real or generated. Most GANs are developed to generate visual media with realistic characteristics. For example, partial STEM images infilled with a GAN are less blurry than images infilled with a non-adversarial generator trained to minimize MSEs 201 cf. figure 2. Alternatively, computationally inexpensive loss functions designed by humans, such as structural similarity index measures 1153 (SSIMs) and Sobel losses 231 , can improve generated output realism. However, it follows from the universal approximator theorems 37-45 that training with ANN discriminators can often yield more realistic outputs.

There are many popular GAN loss functions and regularization mechanisms [1154] [1155] [1156] [1157] [1158] . Traditionally, GANs were trained to minimize logarithmic discriminator, D, and generator, G, losses 1159 ,

where z are generator inputs, G(z) are generated outputs, and x are example outputs. Discriminators predict labels, D(x) and D(G(z)), where target labels are 0 and 1 for generated and real examples, respectively. Limitedly, logarithmic losses are numerically unstable for D(x) → 0 or D(G(z)) → 1, as the denominator,

vanishes. In addition, discriminators must be limited to D(x) > 0 and D(G(z)) < 1, so that logarithms are not complex. To avoid these issues, we recommend training discriminators with squared difference losses 1160, 1161 ,

However, there are a variety of other alternatives to logarithmic loss functions that are also effective 1154, 1155 .

A variety of methods have been developed to improve GAN training 995, 1162 . The most common issues are catastrophic forgetting 1151 iteration. Alternatively, Lipschitz continuity can be imposed by adding a gradient penalty 1165 to GAN losses, such as differences of L2 norms of discriminator gradients from unity,

where ε ∈ [0, 1] is a uniform random variate, λ weights the gradient penalty, andx is an attempt to generate x. However, using a gradient penalty introduces additional gradient backpropagation that increases discriminator training time. There are also a variety of computationally inexpensive tricks that can improve training, such as adding noise to labels 995, 1075, 1166 or balancing discriminator and generator learning rates 349 . These tricks can help to avoid discontinuities in discriminator output distributions that can lead to mode collapse; however, we observe that these tricks do not reliably stabilize GAN training.

Instead, we observe that spectral normalization 1017 reliably stabilizes GAN discriminator training in our electron microscopy research 201, 202, 349 . Spectral normalization controls Lipschitz constants of discriminators by fixing the spectral norms of their weights, as introduced in section 4.2. Advantages of spectral normalization include implementations based on the power iteration method 1023, 1024 being computationally inexpensive, not adding a regularizing loss function that could detrimentally compete 1167, 1168 with discrimination losses, and being effective with one discriminator training iterations per generator training iteration 1017, 1169 . Spectral normalization is popular in GANs for high-resolution image synthesis, where it is also applied in generators to stabilize training 1170 .

There are a variety of GAN architectures 1171 . For high-resolution image synthesis, computation can be decreased by training multiple discriminators to examine image patches at different scales 201, 1172 . For domain translation characterized by textural differences, a cyclic GAN 1004, 1173 consisting of two GANs can map from one domain to the other and vice versa. Alternatively, two GANs can share intermediate layers to translate inputs via a shared embedding domain 1174 . Cyclic GANs can also be combined with a siamese network [279] [280] [281] for domain translation beyond textural differences 1175 . Finally, discriminators can introduce auxiliary losses to train DNNs to generalize to examples from unseen domains 1176-1178 .

Recurrent neural networks 531-536 reuse an ANN cell to process each step of a sequence. Most RNNs learn to model longterm dependencies by gradient backpropagation through time 1179 (BPTT). The ability of RNNs to utilize past experiences enables them to model partially observed and variable length Markov decision processes 1180, 1181 (MDPs). Applications of RNNs include directing vision 1144, 1145 , image captioning 1182, 1183 , language translation 1184 , medicine 77 , natural language processing 1185, 1186 , playing computer games 24 , text classification 1055 , and traffic forecasting 1187 . Many RNNs are combined with CNNs to embed visual media 1145 or words 1188, 1189 , or to process RNN outputs 1190, 1191 . RNNs can also be combined with MLPs 1144 , or text embeddings 1192 such as BERT 1192, 1193 , continuous bag-of-words [1194] [1195] [1196] (CBOW), doc2vec 1197, 1198 , GloVe 1199 , and word2vec 1194, 1200 .

The most popular RNNs consist of long short-term memory 1201-1204 (LSTM) cells or gated recurrent units 1202, [1205] [1206] [1207] (GRUs). LSTMs and GRUs are popular as they solve the vanishing gradient problem 537, 1208, 1209 and have consistently high performance [1210] [1211] [1212] [1213] [1214] [1215] . Their architectures are shown in figure 13 . At step t, an LSTM outputs a hidden state, h t , and cell state, C t , given by

where C t−1 is the previous cell state, h t−1 is the previous hidden state, x t is the step input, and σ is a logistic sigmoid function of equation 10a, 

where (w z , b z ), (w r , b r ), and (w h , b h ) are pairs of weights and biases. Minimal gated units (MGUs) can further reduce computation 1216 . A large-scale analysis of RNN architectures for language translation found that LSTMs consistently outperform GRUs 1210 . GRUs struggle with simple languages that are learnable by LSTMs as the combined hidden and cell states of GRUs make it more difficult for GRUs to perform unbounded counting 1214 1139, 1140 and projection layers 1228 , does not consistently improve performance. For electron microscopy, we recommend defaulting to LSTMs as we observe that their performance is more consistently high than performance of other RNNs. However, LSTM and GRU performance is often comparable, so GRUs are also a good choice to reduce computation.

There are a variety of architectures based on RNNs. Popular examples include deep RNNs 1229 that stack RNN cells to increase representational ability, bidirectional RNNs 1230-1233 that process sequences both forwards and in reverse to improve input utilization, and using separate encoder and decoder subnetworks 1205, 1234 to embed inputs and generate outputs. Hierarchical RNNs [1235] [1236] [1237] [1238] [1239] are more complex models that stack RNNs to efficiently exploit hierarchical sequence information,

25 and include multiple timescale RNNs 1240, 1241 (MTRNNs) that operate at multiple sequence length scales. Finally, RNNs can be augmented with additional functionality to enable new capabilities. For example, attention 1182, 1242-1244 mechanisms can enable more efficient input utilization. Further, creating a neural Turing machine (NTMs) by augmenting a RNN with dynamic external memory 1245, 1246 can make it easier for an agent to solve dynamic graphs.

Architectures of autoencoders where an encoder maps an input to a latent space and a decoder learns to reconstruct the input from the latent space. a) An autoencoder encodes an input in a deterministic latent space, whereas a b) traditional variational autoencoder encodes an input as means, µ, and standard deviations, σ , of Gaussian multivariates, µ + σ · ε, where ε is a standard normal multivariate. In general, semantics of AE outputs are pathological functions of encodings. To generate outputs with well-behaved semantics, traditional VAEs 969, 1260, 1261 learn to encode means, µ, and standard deviations, σ , of Gaussian multivariates. Meanwhile, decoders learn to reconstruct inputs from sampled multivariates, µ + σ · ε, where ε is a standard normal multivariate. Traditional VAE architecture is shown in figure 14b . Usually, VAE encodings are regularized by adding Kullback-Leibler (KL) divergence of encodings from standard multinormals to an AE loss function,

where λ KL weights the contribution of the KL divergence loss for a batch size of B, and a latent space with u degrees of freedom. However, variants of Gaussian regularization can improve clustering 231 , and sparse autoencoders [1262] [1263] [1264] [1265] (SAEs) that regularize encoding sparsity can encode more meaningful features. To generate realistic outputs, a VAE can be combined with a GAN to create a VAE-GAN [1266] [1267] [1268] . Adding a loss to minimize differences between gradients of generated and target outputs is computationally inexpensive alternative that can generate realistic outputs for some applications 231 . A popular application of VAEs is data clustering. For example, VAEs can encode hash tables [1269] [1270] [1271] [1272] [1273] for search engines, and we use VAEs as the basis of our electron micrograph search engines 231 . Encoding clusters visualized by tSNE can be labelled to classify data 231 , and encoding deviations from clusters can be used for anomaly detection [1274] [1275] [1276] [1277] [1278] . In addition, learning encodings with well-behaved semantics enables encodings to be used for semantic manipulation 1278, 1279 . Finally, VAEs can be used as generative models to create synthetic populations 1280, 1281 , develop new chemicals [1282] [1283] [1284] [1285] , and synthesize underrepresented data to reduce imbalanced learning 1286 .

Training, testing, deployment and maintenance of machine learning systems is often time-consuming and expensive [1287] [1288] [1289] [1290] . The first step is usually preparing training data and setting up data pipelines for ANN training and evaluation. Typically, ANN

parameters are randomly initialized for optimization by gradient descent, possibly as part of an automatic machine learning algorithm. Reinforcement learning is a special optimization case where the loss is a discounted future reward. During training, ANN components are often regularized to stabilize training, accelerate convergence, or improve performance. Finally, trained models can be streamlined for efficient deployment. This section introduces each step. We find that electron microscopists can be apprehensive about robustness and interpretability of ANNs, so we also provide subsections on model evaluation and interpretation. Figure 15 . Gradient descent. a) Arrows depict steps across one dimension of a loss landscape as a model is optimized by gradient descent. In this example, the optimizer traverses a small local minimum; however, it then gets trapped in a larger sub-optimal local minimum, rather than reaching the global minimum. b) Experimental DNN loss surface for two random directions in parameter space showing many local minima 1122 . The image in part b) is reproduced with permission under an MIT license 1291 .

Initialize a model, f (x), with trainable parameters, θ 1 . for training step t = 1, T do Forwards propagate a randomly sampled batch of inputs, x, through the model to compute outputs, y = f (x). Compute loss, L t , for outputs. Use the differentiation chain rule 1292 to backpropagate gradients of the loss to trainable parameters, θ t−1 .

Apply an optimizer to the gradients to update θ t−1 to θ t . end for

Most ANNs are iteratively trained by gradient descent 465, 1303-1307 , as described by algorithm 1 and shown in figure 15 . To minimize computation, results at intermediate stages of forward propagation, where inputs are mapped to outputs, are often stored in memory. Storing the forwards pass in memory enables backpropagation memoization by sequentially computing gradients w.r.t. trainable parameters. To reduce memory costs for large ANNs, a subset of intermediate forwards pass results can be saved as starting points to recompute other stages during backpropagation 1308, 1309 . Alternatively, forward pass computations can be split across multiple devices 1310 . Optimization by gradient descent plausibly models learning in some biological systems 1311 . However, gradient descent is not generally an accurate model of biological learning 1312-1314 . There are many popular gradient descent optimizers for deep learning [1303] [1304] [1305] . Update rules for eight popular optimizers are summarized in figure 1 . Other optimizers include AdaBound 1315 , AMSBound 1315 , AMSGrad 1316 , Lookahead 1317 , NADAM 1318 , Nostalgic Adam 1319 , Power Gradient Descent 1320 , Rectified ADAM 1321 (RADAM), and trainable optimizers [1322] [1323] [1324] [1325] [1326] . Gradient descent is effective in the high-dimensional optimization spaces of overparameterized ANNs 1327 as the probability of getting trapped in a sub-optimal local minima decreases as the number of dimensions increases. The simplest optimizer is "vanilla" stochastic gradient descent (SGD), where a trainable parameter perturbation, ∆θ t = θ t − θ t−1 , is the product of a learning rate, η, and derivative of a loss, L t , w.r.t. the trainable parameter, ∂ θ L t . However, vanilla SGD convergence is often limited by

Vanilla SGD 1293, 1294 [η]

Nesterov momentum [1296] [1297] [1298] 

Quasi-hyperbolic momentum 1299 [η, β , ν]

Algorithms 1. Update rules of various gradient descent optimizers for a trainable parameter, θ t , at iteration t, gradients of losses w.r.t. the parameter, ∂ θ L t , and learning rate, η. Hyperparameters are listed in square brackets.

unstable parameter oscillations as it a low-order local optimization method 1328 . Further, vanilla SGD has no mechanism to adapt to varying gradient sizes, which vary effective learning rates as ∆θ ∝ ∂ θ L t . To accelerate convergence, many optimizers introduce a momentum term that weights an average of gradients with past gradients 1296, 1329, 1330 . Momentum-based optimizers in figure 1 are momentum, Nesterov momentum 1296, 1297 , quasi-hyperbolic momentum 1299 , AggMo 1300 , ADAM 1302 , and AdaMax 1302 . To standardize effective learning rates for every layer, adaptive optimizers normalize updates based on an average of past gradient sizes. Adaptive optimizers in figure 1 are RMSProp 1301 , ADAM 1302 , and AdaMax 1302 , which usually result in faster convergence and higher accuracy than other optimizers 1331, 1332 . However, adaptive optimizers can be outperformed by vanilla SGD due to overfitting 1333 , so some researchers adapt adaptive learning rates to their variance 1321 or transition from adaptive optimization to vanilla SGD as training progresses 1315 . For electron microscopy we recommend adaptive optimization with Nadam 1318 , which combines ADAM with Nesterov momentum, as it is well-established and a comparative analysis of select gradient descent optimizers found that it often achieves higher performance than other popular optimizers 1334 . Limitingly, most adaptive optimizers slowly adapt to changing gradient sizes e.g. a default value for ADAM β 2 is 0.999 1302 . To prevent learning being destabilized by spikes in gradient sizes, adaptive optimizers can be combined with adaptive learning rate 261, 1315 or gradient 1208, 1335, 1336 clipping.

For non-adaptive optimizers, effective learning rates are likely to vary due to varying magnitudes of gradients w.r.t. trainable parameters. Similarly, learning by biological neurons varies as stimuli usually activate a subset of neurons 1337 . However, all neuron outputs are usually computed for ANNs. Thus, not effectively using all weights to inform decisions is computational inefficient. Further, inefficient weight updates can limit representation capacity, slow convergence, and decrease training stability. A typical example is effective learning rates varying between layers. Following the chain rule, gradients backpropagated to the ith layer of a DNN from its start are

for a DNN with L layers. Vanishing gradients 537, 1208, 1209 occur when many layers have ∂ x l+1 /∂ x l 1. For example, DNNs with logistic sigmoid activations often exhibit vanishing gradients as their maximum gradient is 1/4 cf. equation 10b. Similarly,

exploding gradients 537, 1208, 1209 occur when many layers have ∂ x l+1 /∂ x l 1. Adaptive optimizers alleviate vanishing and exploding gradients by dividing gradients by their expected sizes. Nevertheless, it is essential to combine adaptive optimizers with appropriate initialization and architecture to avoid numerical instability.

Optimizers have a myriad of hyperparameters to be initialized and varied throughout training to optimize performance 1338 cf. figure 1. For example, stepwise exponentially decayed learning rates are often theoretically optimal 1339 . There are also various heuristics that are often effective, such as using a DEMON decay schedule for an ADAM first moment of the momentum decay rate 1340 ,

where β init is the initial value of β 1 , t is the iteration number, and T is the final iteration number. Developers often optimize ANN hyperparameters by experimenting with a range of heuristic values. Hyperparameter optimization algorithms [1341] [1342] [1343] [1344] [1345] [1346] can automate optimizer hyperparameter selection. However, automatic hyperparameter optimizers may not yield sufficient performance improvements relative to well-established heuristics to justify their use, especially in initial stages of development.

Alternatives to gradient descent 1347 are rarely used for parameter optimization as they are not known to consistently improve upon gradient descent. For example, simulated annealing 1348, 1349 has been applied to CNN training 1350, 1351 , and can be augmented with momentum to accelerate convergence in deep learning 1352 . Simulated annealing can also augment gradient descent to improve performance 1353 . Other approaches include evolutionary 1354, 1355 and genetic 1356, 1357 algorithms, which can be a competitive alternative to deep reinforcement learning where convergence is slow 1358 . Indeed, recent genetic algorithms have outperformed a popular deep reinforcement learning algorithm 1359 . Another direction is to augment genetic algorithms with ANNs to accelerate convergence [1360] [1361] [1362] [1363] . Other alternatives to backpropagation include direct search 1364 , the Moore-Penrose Pseudo Inverse 1365 ; particle swarm optimization [1366] [1367] [1368] [1369] (PSO); and echo-state networks [1370] [1371] [1372] (ESNs) and extreme learning machines [1373] [1374] [1375] [1376] [1377] [1378] [1379] (ELMs), where some randomly initialized weights are never updated.

Reinforcement learning [1380] [1381] [1382] [1383] [1384] [1385] [1386] (RL) is where a machine learning system, or "actor", is trained to perform a sequence of actions. Applications include autonomous driving [1387] [1388] [1389] , communications network control 1390, 1391 , energy and environmental management 1392, 1393 , playing games 24-29, 1146, 1394 , and robotic manipulation 1395, 1396 . To optimize a MDP 1180, 1181 , a discounted future reward, Q t , at step t in a MDP with T steps is usually calculated from step rewards, r t , with Bellman's equation,

where γ ∈ [0, 1) discounts future step rewards. To be clear, multiplying Q t by −1 yields a loss that can be minimized using the methods in section 6.1. In practice, many MDPs are partially observed or have non-differentiable losses that may make it difficult to learn a good policy from individual observations. However, RNNs can often learn a model of their environments from sequences of observations 1147 . Alternatively, FNNs can be trained with groups of observations that contain more information than individual observations 1146, 1394 . If losses are not differentiable, a critic can learn to predict differentiable losses for actor training cf. section 5.1. Alternatively, actions can be sampled from a differentiable probability distribution 1144, 1397 as training losses given by products of losses and sampling probabilities are differentiable. There are also a variety of alternatives to gradient descent introduced at the end of section 6.1 that do not require differentiable loss functions.

There are a variety of exploration strategies for RL 1398, 1399 . Adding Ornstein-Uhlenbeck 1400 (OU) noise to actions is effective for continuous control tasks optimized by deep deterministic policy gradients 1146 (DDPG) or recurrent deterministic policy gradients 1147 (RDPG) RL algorithms. Adding Gaussian noise achieves similar performance for optimization by TD3 1401 or D4PG 1402 RL algorithms. However, a comparison of OU and Gaussian noise across a variety of tasks 1403 found that OU noise usually achieves similar performance to or outperforms Gaussian noise. Similarly, exploration can be induced by adding noise to ANN parameters 1404, 1405 . Other approaches to exploration include rewarding actors for increasing action entropy [1405] [1406] [1407] and intrinsic motivation [1408] [1409] [1410] , where ANNs are incentified to explore actions that they are unsure about.

RL algorithms are often partitioned into online learning 1411, 1412 , where training data is used as it is acquired; and offline learning 1413, 1414 , where a static training dataset has already been acquired. However, many algorithms operate in an intermediate regime, where data collected with an online policy is stored in an experience replay 1415-1417 buffer for offline learning. Training data is often sampled at random from a replay. However, prioritizing the replay of data with high losses 993 or data that results in high policy improvements 992 often improves actor performance. A default replay buffer size of around 10 6 examples is often used; however, training is sensitive to replay buffer size 1418 . If the replay is too small, changes in actor policy may destabilize training; whereas if the replay is too large, convergence may be slowed by delays before the actor learns from policy changes.

There are a variety of automatic machine learning [1419] [1420] [1421] [1422] [1423] (AutoML) algorithms that can create and optimize ANN architectures and learning policies for a dataset of input and target output pairs. Most AutoML algorithms are based on RL or evolutionary algorithms. Examples of AutoML algorithms include AdaNet 1424 1436 , and others [1437] [1438] [1439] [1440] [1441] . AutoML is becoming increasingly popular as it can achieve higher performance than human developers 1077, 1442 and enables human developer time to be traded for potentially cheaper computer time. Nevertheless, AutoML is currently limited to established ANN architectures and learning policies. Following, we recommend that researchers either focus on novel ANN architectures and learning policies or developing ANNs for novel applications.

How ANN trainable parameters are initialized 537, 1443 is related to model capacity 1444 . Further, initializing parameters with values that are too small or large can cause slow learning or divergence 537 . Careful initialization can also prevent training by gradient descent being destabilized by vanishing or exploding gradients 537, 1208, 1209 , or high variance of length scales across layers 537 . Finally, careful initialization can enable momentum to accelerate convergence and improve performance 1296 . Most trainable parameters are multiplicative weights or additive biases. Initializing parameters with constant values would result in every parameter in a layer receiving the same updates by gradient descent, reducing model capacity. Thus, weights are often randomly initialized. Following, biases are often initialized with constant values due to symmetry breaking by the weights.

Consider the projection of n in inputs,

, ..., x output n out }, by an n in × n out weight matrix, w. The expected variance of an output element is 1443

where E(x) and Var(x) denote the expected mean and variance of elements of x, respectively. For similar length scales across layers, Var(x output ) should be constant. Initially, similar variances can be achieved by normalizing ANN inputs to have zero mean, so that E(x input ) = 0, and initializing weights so that E(w) = 0 and Var(w) = 1/n in . However, parameters can shift during training, destabilizing learning. To compensate for parameter shift, popular normalization layers like batch normalization often impose E(x input ) = 0 and Var(x input ) = 1, relaxing need for E(x input ) = 0 or E(w) = 0. Nevertheless, training will still be sensitive to the length scale of trainable parameters.

There are a variety of popular weight initializers that adapt weights to ANN architecture. One of the oldest methods is LeCun initialization 941, 951 , where weights are initialized with variance,

which is argued to produce outputs with similar length scales in the previous paragraph. However, a similar argument can be made for initializing with Var(w) = 1/n out to produce similar gradients at each layer during the backwards pass 1443 . As a compromise, Xavier initialization 1445 computes an average,

However, adjusting weights for n out is not necessary for adaptive optimizers like ADAM, which divide gradients by their length scales, unless gradients will vanish or explode. Finally, He initialization 22 doubles the variance of weights to

and is often used in ReLU networks to compensate for activation functions halving variances of their outputs 22, 1443, 1446 . Most trainable parameters are initialized from either a zero-centred Gaussian or uniform distribution. For convenience, the limits of such a uniform distribution are ±(3Var(w)) 1/2 . Uniform initialization can outperform Gaussian initialization in DNNs due to Gaussian outliers harming learning 1443 . However, issues can be avoided by truncating Gaussian initialization, often to two standard deviations, and rescaling to its original variance. Some initializers are mainly used for RNNs. For example, orthogonal initialization 1447 often improves RNN training 1448 by reducing susceptibility to vanishing and exploding gradients. Similarly, identity initialization 1449, 1450 can help RNNs to learn long-term dependencies. In most ANNs, biases are initialized with zeros. However, the forget gates of LSTMs are often initialized with ones to decrease forgetting at the start of training 1211 . Finally, the start states of most RNNs are initialized with zeros or other constants. However, random multivariate or trainable variable start states can improve performance 1451 .

There are a variety of alternatives to initialization from random multivariates. Weight normalized 1014 ANNs are a popular example of data-dependent initialization, where randomly initialized weight magnitudes and biases are chosen to counteract variances and means of an initial batch of data. Similarly, layer-sequential unit-variance (LSUV) initialization 1452 consists of orthogonal initialization followed by adjusting the magnitudes of weights to counteract variances of an initial batch of data. Other approaches standardize the norms of backpropagated gradients. For example, random walk initialization 1453 (RWI) finds scales for weights to prevent vanishing or exploding gradients in deep FNNs, albeit with varied success 1452 . Alternatively, MetaInit 1454 scales the magnitudes of randomly initialized weights to minimize changes in backpropagated gradients per iteration of gradient descent.

There are a variety of regularization mechanisms 1455-1458 that modify learning algorithms to improve ANN performance. One of the most popular is LX regularization, which decays weights by adding a loss,

weighted by λ X to each trainable variable, θ i . L2 regularization 1459-1461 is preferred 1462 for most DNN optimization as subtraction of its gradient, ∂ θ i L 2 = λ 2 θ i , is equivalent to computationally-efficient multiplicative weight decay. Nevertheless, L1 regularization is better at inducing model sparsity 1463 than L2 regularization, and L1 regularization achieves higher performance in some applications 1464 . Higher performance can also be achieved by adding both L1 and L2 regularization in elastic nets 1465 . LX regularization is most effective at the start of training and becomes less important near convergence 1459 . Finally, L1 and L2 regularization are closely related to lasso 1466 and ridge 1467 regularization, respectively, whereby trainable parameters are adjusted to limit L 1 and L 2 losses. Gradient clipping 1336, [1468] [1469] [1470] accelerates learning by limiting large gradients, and is most commonly applied to RNNs. A simple approach is to clip gradient magnitudes to a threshold hyperparameter. However, it is more common to scale gradients, g i , at layer i if their norm is above a threshold, u, so that 1208, 1469

where n = 2 is often chosen to minimize computation. Similarly, gradients can be clipped if they are above a global norm,

computed with gradients at L layers. Scaling gradient norms is often preferable to clipping to a threshold as scaling is akin to adapting layer learning rates and does not affect the directions of gradients. Thresholds for gradient clipping are often set based on average norms of backpropagated gradients during preliminary training 1471 . However, thresholds can also be set automatically and adaptively 1335, 1336 . In addition, adaptive gradient clipping algorithms can skip training iterations if gradient norms are anomalously high 1472 , which often indicates an imminent gradient explosion. Dropout 1473-1477 often reduces overfitting by only using a fraction, p i , of layer i outputs during training, and multiplying all outputs by p i for inference. However, dropout often increases training time, can be sensitive to p i , and sometimes lowers performance 1478 . Improvements to dropout at the structural level, such as applying it to convolutional channels, paths, and layers, rather than random output elements, can improve performance 1479 . For example, DropBlock 1480 improves performance by dropping contiguous regions of feature maps to prevent dropout being trivially circumvented by using spatially correlated neighbouring outputs. Similarly, PatchUp 1481 swaps or mixes contiguous regions with regions for another sample. Dropout is often outperformed by Shakeout 1482, 1483 , a modification of dropout that randomly enhances or reverses contributions of outputs to the next layer.

Noise often enhances ANN training by decreasing susceptibility to spurious local minima 1484 . Adding noise to trainable parameters can improve generalization 1485, 1486 , or exploration for RL 1404 . Parameter noise is usually additive as it does not change an objective function being learned, whereas multiplicative noise can change the objective 1487 . In addition, noise can be added to inputs 1253, 1488 , hidden layers 1158, 1489 , generated outputs 1490 or target outputs 995, 1491 . However, adding noise to signals does not always improve performance 1217 . Finally, modifying usual gradient noise 1492 by adding noise to gradients can improve performance 1493 . Typically, additive noise is annealed throughout training, so that that final training is with a noiseless model that will be used for inference.

There are a variety of regularization mechanisms that exploit extra training data. A simple approach is to create extra training examples by data augmentation [1494] [1495] [1496] . Extra training data can also be curated, or simulated for training by domain adaption [1176] [1177] [1178] . Alternatively, semi-supervised learning [1497] [1498] [1499] [1500] [1501] [1502] can generate target outputs for a dataset of unpaired inputs to augment training with a dataset of paired inputs and target outputs. Finally, multitask learning [1503] [1504] [1505] [1506] [1507] can improve performance by introducing additional loss functions. For instance, by adding an auxiliary classifier to predict image labels from features generated by intermediate DNN layers [1508] [1509] [1510] [1511] . Losses are often manually balanced; however, their gradients can also be balanced automatically and adaptively 1167, 1168 .

A data pipeline prepares data to be input to an ANN. Efficient pipelines often parallelize data preparation across multiple CPU cores 1512 . Small datasets can be stored in RAM to decrease data access times, whereas large dataset elements are often loaded from files. Loaded data can then be preprocessed and augmented 1494, 1495, [1513] [1514] [1515] . For electron micrographs, preprocessing often includes replacing non-finite elements, such as NaN and inf, with finite values; linearly transforming intensities to a common range, such as [−1, 1] or zero mean and unit variance; and performing a random combination of flips and 90 • to augment data by a factor of eight 70, 201, 202, 231, 349 . Preprocessed examples can then be combined into batches. Typically, multiple batches that are ready to be input are prefetched and stored in RAM to avoid delays due to fluctuating CPU performance.

To efficiently utilize data, training datasets are often reiterated over for multiple training epochs. Usually, training datasets are reiterated over about 10 2 times. Increasing epochs can maximize utilization of potentially expensive training data; however, increasing epochs can lower performance due to overfitting 1516, 1517 or be too computationally expensive 539 . Naively, batches of data can be randomly sampled with replacement during training by gradient descent. However, convergence can be accelerated by reinitializing a training dataset at the start of each training epoch and randomly sampling data without replacement [1518] [1519] [1520] [1521] [1522] . Most modern DLFs, such as TensorFlow, provide efficient and easy-to-use functions to control data sampling 1523 .

There are a variety of methods for ANN performance evaluation 538 . However, most ANNs are evaluated by 1-fold validation, where a dataset is partitioned into training, validation, and test sets. After ANN optimization with a training set, ability to generalize is measured with a validation set. Multiple validations may be performed for training with early stopping 1516, 1517 or ANN learning policy and architecture selection, so final performance is often measured with a test set to avoid overfitting to the validation set. Most researchers favour using single training, validation, and test sets to simplify standardization of performance benchmarks 231 . However, multiple-fold validation 538 or multiple validation sets 1524 can improve performance characterization. Alternatively, models can be bootstrap aggregated 1525 (bagged) from multiple models trained on different subsets of training data. Bagging is usually applied to random forests [1526] [1527] [1528] or other lightweight models, and enables model uncertainly to be gauged from the variance of model outputs.

For small datasets, model performance is often sensitive to split of data between training and validation sets 1529 . Increasing training set size usually increases model accuracy, whereas increasing validation set size decreases performance uncertainty. Indeed, a scaling law can be used to estimate an optimal tradeoff 1530 between training and validation set sizes. However, most experimenters follow a Pareto 1531 splitting heuristic. For example, we often use a 75:15:10 training-validation-test split 231 . Heuristic splitting is justified for ANN training with large datasets insofar that sensitivity to splitting ratios decreases with increasing dataset size 2 .

If an ANN is deployed 1532-1534 on multiple different devices, such as various electron microscopes, a separate model can be trained for each device 403 , Alternatively, a single model can be trained and specialized for different devices to decrease training requirements 1535 . In addition, ANNs can remotely service requests from cloud containers [1536] [1537] [1538] . Integration of multiple ANNs can be complicated by different servers for different DLFs supporting different backends; however, unified interfaces are available. For example, GraphPipe 1539 provides simple, efficient reference model servers for Tensorflow, Caffe2, and ONNX; a minimalist machine learning transport specification based on FlatBuffers 1540 ; and efficient client implementations in Go, Python, and Java. In 2020, most ANNs developed researchers were not deployed. However, we anticipate that deployment will become a more prominent consideration as the role of deep learning in electron microscopy matures.

Most ANNs are optimized for inference by minimizing parameters and operations from training time, like MobileNets 1094 . However, less essential operations can also be pruned after training 1541, 1542 . Another approach is quantization, where ANN bit depths are decreased, often to efficient integer instructions, to increase inference throughput 1543, 1544 . Quantization often decreases performance; however, the amount of quantization can be adapted to ANN components to optimize performancethroughput tradeoffs 1545 . Alternatively, training can be modified to minimize the impact of quantization on performance [1546] [1547] [1548] .

Another approach is to specialize bit manipulation for deep learning. For example, signed brain floating point (bfloat16) often improves accuracy on TPUs by using an 8 bit mantissa and 7 bit exponent, rather than a usual 5 bit mantissa and 10 bit exponent 1549 . Finally, ANNs can be adaptively selected from a set of ANNs based on available resources to balance tradeoff of performance and inference time 1550 , similar to image optimization for web applications 1551, 1552 .

Inputs that maximally activate channels in GoogLeNet 1076 after training on ImageNet 71 . Neurons in layers near the start have small receptive fields and discern local features. Middle layers discern semantics recognisable by humans, such as dogs and wheels. Finally, layers at the end of the DNN, near its logits, discern combinations of semantics that are useful for labelling. This figure is adapted with permission 1553 under a Creative Commons Attribution 4.0 73 license.

We find that some electron microscopists are apprehensive about working with ANNs due to a lack of interpretability, irrespective of rigorous ANN validation. We try to address uncertainty by providing loss visualizations in some of our electron microscopy papers 70, 201, 202 . However, there are a variety of popular approaches to explainable artificial intelligence [1554] [1555] [1556] [1557] [1558] [1559] [1560] (XAI). One of the most popular approaches to XAI is saliency [1561] [1562] [1563] [1564] , where gradients of outputs w.r.t. inputs correlate with their importance. Saliency is often computed by gradient backpropagation [1565] [1566] [1567] . For example, with Grad-CAM 1568 or its variants [1569] [1570] [1571] [1572] . Alternatively, saliency can be predicted by ANNs 1054, 1573, 1574 or a variety of methods inspired by Grad-CAM [1575] [1576] [1577] . Applications of saliency include selecting useful features from a model 1578 , and locating regions in inputs corresponding to ANN outputs 1579 .

There are a variety of other approaches to XAI. For example, feature visualization via optimization 1553, [1580] [1581] [1582] [1583] can find inputs that maximally activate parts of an ANN, as shown in figure 16 . Another approach is to cluster features, e.g. by tSNE 1584, 1585 with the Barnes-Hut algorithm 1586, 1587 , and examine corresponding clustering of inputs or outputs 231 . Finally, developers can view raw features and gradients during forward and backward passes of gradient descent, respectively. For example, CNN explainer 1588, 1589 is an interactive visualization tool designed for non-experts to learn and experiment with CNNs. Similarly, GAN Lab 1590 is an interactive visualization tool for non-experts to learn and experiment with GANs.

We introduced a variety of electron microscopy applications in section 1 that have been enabled or enhanced by deep learning. Nevertheless, the greatest benefit of deep learning in electron microscopy may be general-purpose tools that enable researchers to be more effective. Search engines based on deep learning are almost essential to navigate an ever-increasing number of scientific publications 700 . Further, machine learning can enhance communication by filtering spam and phishing attacks [1591] [1592] [1593] , and by summarizing [1594] [1595] [1596] and classifying 1055, 1597-1599 scientific documents. In addition, machine learning can be applied to education to automate and standardize scoring [1600] [1601] [1602] [1603] , detect plagiarism [1604] [1605] [1606] , and identify at-risk students 1607 .

Creative applications of deep learning 1608, 1609 include making new art by style transfer 1001-1005 , composing music [1610] [1611] [1612] , and storytelling 1613, 1614 . Similar DNNs can assist programmers 1615, 1616 . For example, by predictive source code completion [1617] [1618] [1619] [1620] [1621] [1622] , and by generating source code to map inputs to target outputs 1623 or from labels describing desired source code 1624 . Text generating DNNs can also help write scientific papers. For example, by drafting scientific passages 1625 or drafting part of a paper from a list of references 1626 . Papers generated by early prototypes for automatic scientific paper generators, such as SciGen 1627 , are realistic insofar that they have been accepted by scientific venues.

An emerging application of deep learning is mining scientific resources to make new scientific discoveries 1628 . Artificial agents are able to effectively distil latent scientific knowledge as they can parallelize examination of huge amounts of data, whereas information access by humans [1629] [1630] [1631] is limited by human cognition 1632 . High bandwidth bi-directional brainmachine interfaces are being developed to overcome limitations of human cognition 1633 ; however, they are in the early stages of development and we expect that they will depend on substantial advances in machine learning to enhance control of cognition. Eventually, we expect that ANNs will be used as scientific oracles, where researchers who do not rely on their services will no longer be able to compete. For example, an ANN trained on a large corpus of scientific literature predicted multiple advances in materials science before they were reported 1634 . ANNs are already used for financial asset management 1635, 1636 and recruiting [1637] [1638] [1639] [1640] , so we anticipate that artificial scientific oracle consultation will become an important part of scientific grant 1641, 1642 reviews.

A limitation of deep learning is that it can introduce new issues. For example, DNNs are often susceptible to adversarial attacks [1643] [1644] [1645] [1646] [1647] , where small perturbations to inputs cause large errors. Nevertheless, training can be modified to improve robustness to adversarial attacks [1648] [1649] [1650] [1651] [1652] . Another potential issue is architecture-specific systematic errors. For example, CNNs often exhibit structured systematic error variation 70, 201, 202, 1092, 1093, 1653 , including higher errors nearer output edges 70, 201, 202 . However, structured systematic error variation can be minimized by GANs incentifying the generation of realistic outputs 201 . Finally, ANNs can be difficult to use as they often require downloading code with undocumented dependencies, downloading a pretrained model, and may require hardware accelerators. These issues can be avoided by serving ANNs from cloud containers. However, it may not be practical for academics to acquire funding to cover cloud service costs.

Perhaps the most important aspect of deep learning in electron microscopy is that it presents new challenges that can lead to advances in machine learning. Simple benchmarks like CIFAR-10 562, 563 and MNIST 564 have been solved. Following, more difficult benchmarks like Fashion-MNIST 1654 have been introduced. However, they only partially address issues with solved datasets as they do not present fundamentally new challenges. In contrast, we believe that new problems often invite new solutions. For example, we developed adaptive learning rate clipping 261 (ALRC) to stabilize training of DNNs for partial scanning transmission electron microscopy 201 . The challenge was that we wanted to train a large model for high-resolution images; however, training was unstable if we used small batches needed to fit it in GPU memory. Similar challenges abound and can lead to advances in both machine learning and electron microscopy.

No new data were created or analysed in this study. 29. Tesauro, G. Programming Backgammon Using Self-Teaching Neural Nets. Artif. Intell. 134, 181-199 (2002 

This introductory chapter covers my review paper 96 There are many review papers on deep learning. Some reviews of deep learning focus on computer science 97-101 , whereas others focus on specific applications such as computational imaging 102 , materials science [103] [104] [105] , and the physical sciences 106 . As a result, I anticipated that another author might review deep learning in electron microscopy.

To avoid my review being easily surpassed, I leveraged my experience to offer practical perspectives and comparative discussions to address common causes of confusion. In addition, content is justified by extensive references to make it easy to use as a starting point for future research. Finally, I was concerned that information about how to get started with deep learning in electron microscopy was fragmented and unclear to unfamiliar developers. This was often problematic when I was asked about getting started with machine learning, and I was especially conscious of it as my friend, Rajesh Patel, asked me for advice when I started writing my review. Consequently, I included a section that introduces useful resources for deep learning in electron microscopy.

We have set up new repositories [1] to make our large new electron microscopy datasets available to both electron microscopists and the wider community. There are three main datasets containing 19769 experimental scanning transmission electron microscopy [2] (STEM) images, 17266 experimental transmission electron microscopy [2] (TEM) images and 98340 simulated TEM exit wavefunctions [3] . Experimental datasets represent general research and were collected by dozens of University of Warwick scientists working on hundreds of projects between January 2010 and June 2018. We have been using our datasets to train artificial neural networks (ANNs) for electron microscopy [3] [4] [5] [6] [7] , where standardizing results with common test sets has been essential for comparison. This paper provides details of and visualizations for datasets and their variants, and is supplemented by source code, pretrained models, and both static and interactive visualizations [8] .

Machine learning is increasingly being applied to materials science [9, 10] , including to electron microscopy [11] . Encouraging scientists, ANNs are universal approximators [12] that can leverage an understanding of physics to represent [13] the best way to perform a task with arbitrary accuracy. In theory, this means that ANNs can always match or surpass the performance of contemporary methods. However, training, validating and testing requires large, carefully partitioned datasets [14, 15] to ensure that ANNs are robust to general use. To this end, our datasets are partitioned so that each subset has different characteristics. For example, TEM or STEM images can be partitioned so that subsets are collected by different scientists, and simulated exit wavefunction partitions can correspond to Crystallography Information Files [16] (CIFs) for materials published in different journals.

Most areas of science are facing a reproducibility crisis [17] , including artificial intelligence [18] . Adding to this crisis, natural scientists do not always benchmark ANNs against standardized public domain test sets; making results difficult or impossible to compare. In electron microscopy, we believe this is a symptom of most datasets being small, esoteric or not having default partitions for machine learning. For example, most datasets in the Electron Microscopy Public Image Archive [19, 20] are for specific materials and are not partitioned. In contrast, standard machine learning datasets such as CIFAR-10 [21, 22] , MNIST [23] , and ImageNet [24] have default partitions for machine learning and contain tens of thousands or millions of J M Ede examples. By publishing our large, carefully partitioned machine learning datasets, and setting an example by using them to standardize our research, we aim to encourage higher standardization of machine learning research in the electron microscopy community.

There are many popular algorithms for high-dimensional data visualization [25] [26] [27] [28] [29] [30] [31] [32] that can map N high-dimensional vectors of features {x 1 , ..., x N }, x i ∈ R u to low-dimensional vectors {y 1 , ..., y N }, y i ∈ R v . A standard approach for data clustering in v ∈ {1, 2, 3} dimensions is t-distributed stochastic neighbor embedding [33, 34] (tSNE). To embed data by tSNE, Kullback-Leibler (KL) divergence,

is minimized by gradient descent [35] for normally distributed pairwise similarities in real space, p ij , and heavy-tailed Student t-distributed pairwise similarities in an embedding space, q ij . For symmetric tSNE [33] ,

To control how much tSNE clusters data, perplexities of p i|j for j ∈ {1, ..., N} are adjusted to a user-provided value by fitting α j . Perplexity, exp(H), is an exponential function of entropy, H, and most tSNE visualizations are robust to moderate changes to its value.

Feature extraction is often applied to decrease input dimensionality, typically to u ≲ 100, before clustering data by tSNE. Decreasing input dimensionality can decrease data noise and computation for large datasets, and is necessary for some high-dimensional data as distances, ||x i − x j || 2 , used to compute p ij are affected by the curse of dimensionality [36] . For image data, a standard approach [33] to extract features is probabilistic [37, 38] or singular value decomposition [39] (SVD) based principal component analysis [40] (PCA). However, PCA is limited to linearly separable features. Other hand-crafted feature extraction methods include using a histogram of oriented gradients [41] , speeded-up robust features [42] , local binary patterns [43] , wavelet decomposition [44] and other methods [45] . The best features to extract for a visualization depend on its purpose. However, most hand-crafted feature extraction methods must be tuned for different datasets. For example, Minka's algorithm [46] is included in the scikit-learn [47] implementation of PCA by SVD to obtain optimal numbers of principal components to use.

To increase representation power, nonlinear and dataset-specific features can be extracted with deep learning. For example, by using the latent space of an autoencoder [48, 49] (AE) or features before logits in a classification ANN [50] . Indeed, we have posted AEs for electron microscopy with pre-trained models [51, 52] that could be improved. However, AE latent vectors can exhibit inhomogeneous dimensional characteristics and pathological semantics, limiting correlation between latent features and semantics. To encode well-behaved latent vectors suitable for clustering by tSNE, variational autoencoders [53, 54] (VAEs) can be trained to encode data as multivariate probability distributions. For example, VAEs are often regularized to encode multivariate normal distributions by adding KL divergence of encodings from a standard normal distribution to its loss function [53] . The regularization homogenizes dimensional characteristics and sampling noise correlates semantics with latent features.

To visualize datasets presented in this paper, we trained VAEs shown in figure 1 to embed 96× 96 images in u = 64 dimensions before clustering in v = 2 dimensions by tSNE. Our VAE consists of two convolutional neural networks [55, 56] and maximum values of 0 and 1, respectively, and we apply a random combination of flips and 90

• rotations to augment training data by a factor of eight. The generator, G, is trained to cooperate with the encoder to output encoder inputs by sampling latent vectors,

.., σ iu }, and ϵ i = {ϵ i1 , ..., ϵ iu } are random variates sampled from standard normal distributions, ε ij~N (0, 1). Each convolutional or fully connected layer is followed by batch normalization [57] then ReLU [58] activation, except the output layers of the encoder and generator. An absolute nonlinearity, f(x) = |x|, is applied to encode positive standard deviations. Traditional VAEs are trained to optimize a balance, λ MSE , between mean squared errors (MSEs) of generated images and KL divergence of encodings from a multivariate standard normal distribution [53] ,

However, traditional VAE training is sensitive to λ MSE [59] and other hyperparameters [60] . If λ MSE is too low, the encoder will learn learn to consistently output σ ij ≃ 1, limiting regularization. Else if λ MSE is too high, the encoder will learn to output σ ij ≪ |µ ij |, limiting regularization. As a result, traditional VAE hyperparameters must be carefully tuned for different ANN architectures and datasets. To improve VAE regularization and robustness to different datasets, we normalize encodings parameterizing normal distributions to

where batch means and standard deviations are Encoding normalization is a modified form of batch normalization [57] for VAE latent spaces. As part of encoding normalization, we introduce a new hyperparameter, λ µ , to scale the ratio of expectations E(|µ ij |)/E(|σ ij |). We use λ µ = 2.5 in this paper; however, we confirm that training is robust to values λ µ ∈ {1.0, 2.0, 2.5} for a range of datasets and ANN architectures. Batch means are subtracted from µ and not σ so that σ ij ≥ 0. In addition, we multiply σ std,j by an arbitrary factor of 2 so that E(|µ ij |) ≈ E(|σ ij |) for λ µ = 1. Encoding normalization enables the KL divergence loss in equation 5 to be removed as latent space regularization is built into the encoder architecture. However, we find that removing the KL loss can result in VAEs encoding either very low or very high σ ij . In effect, an encoder can learn to use σ apply a binary mask to µ if a generator learns that latent features with very high absolute values are not meaningful. To prevent extreme σ ij , we add a new encoding regularization loss, MSE(σ, 1), to the encoder. Human vision is sensitive to edges [61] , so we also add a gradient-based loss to improve realism. Adding a gradient-based loss is a computationally inexpensive alternative to training a variational autoencoder generative adversarial network [62] (VAE-GAN) and often achieves similar performance. Our total training loss is L = λ MSE MSE(G(z), I) + λ Sobel MSE(S(G(z)), S(I)) + MSE(σ, 1) , 

and a DEMON [67] first moment of the momentum decay rate,

where we chose initial values η start = 0.001 and β start = 0.9, exponential base a = 0.5, b = 8 steps, and T = 600000 iterations. We used a batch size of B = 64 and emphasize that a large batch size decreases complication of encoding normalization by varying batch statistics. Training our VAEs takes about 12 hours on a desktop computer with an Nvidia GTX 1080 Ti GPU and an Intel i7-6700 CPU. To use VAE latent spaces to cluster data, means are often embedded by tSNE. However, this does not account for highly varying σ used to calculate latent features. To account for uncertainty, we modify calculation of pairwise similarities, p i|j , in equation 2 to include both µ i and σ i encoded for every example, i ∈ [1, N], in our datasets,

where we chose weights

We add ε = 0.01 for numerical stability, and to account for uncertainty in σ due to encoder imperfections or variation in batch statistics. Following Oskolkov [68] , we fit α j to perplexities given by N 1/2 , where N is the number of examples in a dataset, and confirm that changing perplexities by ± 100 has little effect on visualizations for our N ≃ 20000 TEM and STEM datasets. To ensure convergence, we run tSNE computations for 10000 iterations. In comparison, KL divergence is stable by 5000 iterations for our datasets.

In preliminary experiments, we observe that tSNE with σ results in comparable visualizations to tSNE without σ, and we think that tSNE with σ may be a slight improvement. For comparison, pairs of visualizations with and without σ are indicated in supplementary information.

Our improvements to dataset visualization by tSNE are showcased in figure 2 for various embedding methods. The visualizations are for a new dataset containing 19769 96× 96 crops from STEM images, which will be introduced in section 3. To suppress high-frequency noise during training, images were blurred by a 5× 5 symmetric Gaussian kernel with a 2.5 px standard deviation. Clusters are most distinct in figure 2(a) for encoding normalized VAE training with a gradient loss described by equation 11. Ablating the gradient loss in figure 2(b) results in similar clustering; however, the VAE struggles to separate images of noise and fine atom columns. In contrast, clusters are not clearly separated in figure 2(c) for a traditional VAE described by equation 5. Finally, embedding the first 50 principal components extracted by a scikit-learn [69] implementation of probabilistic PCA in figure 2(d) does not result in clear clustering.

We curated 19769 STEM images from University of Warwick electron microscopy dataservers to train ANNs for compressed sensing [5, 7] . Atom columns are visible in roughly two-thirds of images, and similar proportions are bright and dark field. In addition, most signals are noisy [76] and are imaged at several times their Nyquist rates [77] . To reduce data transfer times for large images, we also created variant containing 161069 non-overlapping 512× 512 crops from full images. For rapid development, we have also created new variants containing 96× 96 images downsampled or cropped from full images. In this section we give details of each STEM dataset, referring to them using their names in our repositories. and 2966 test set images. The dataset was made by concatenating contributions from different scientists, so partitioning the dataset before shuffling also partitions scientists. images when map points are hovered over is also available [8] . This paper is aimed at a general audience so readers may not be familiar with STEM. Subsequently, example images are tabulated with references and descriptions in table 1 to make them more tangible.

We curated 17266 2048× 2048 high-signal TEM images from University of Warwick electron microscopy dataservers to train ANNs to improve signal-to-noise [4] . However, our dataset was only available upon request. It is now openly available [1] . For convenience, we have also created a new variant containing 96× 96 images that can be used for rapid ANN development. In this section we give details of each TEM dataset, referring to them using their names in our repositories. TEM Full Images: 17266 32-bit TIFFs containing 2048× 2048 TEM images taken with University of Warwick JEOL 2000, JEOL 2100, JEOL 2100+, and JEOL ARM 200F electron microscope by dozens of scientists working on hundreds of projects. Images were originally saved in DigitalMicrograph DM3 or DM4 files created by Gatan Microscopy Suite [78] software and have been cropped to largest possible squares and area resized to 2048× 2048 with MATLAB and default antialiasing. Images with at least 2500 electron counts per pixel were then linearly transformed to have minimum and maximum values of 0 and 1, respectively. We discarded images with less than 2500 electron counts per pixel as images were curated to train an electron micrograph denoiser [4] . The dataset is partitioned into 11350 training, 2431 validation, and 3486 test set images. The dataset was made by concatenating contributions from different scientists, so each partition contains data collected by a different subset of scientists. images when map points are hovered over is also available [8] . This paper is aimed at a general audience so readers may not be familiar with TEM. Subsequently, example images are tabulated with references and descriptions in table 2 to make them more tangible.

We simulated 98340 TEM exit wavefunctions to train ANNs to reconstruct phases from amplitudes [3] . Half of wavefunction information is undetected by conventional TEM as only the amplitude, and not the phase, of an image is recorded. Wavefunctions were simulated at 512× 512 then centre-cropped to 320× 320 to remove simulation edge artefacts. Wavefunctions have been simulated for real physics where Kirkland potentials [87] for each atom are summed from n = 3 terms, and by truncating Kirkland potential summations to n = 1 to simulate an alternative universe where atoms have different potentials. Wavefunctions simulated for an alternate universe can be used to test ANN robustness to simulation physics. Table 2 . Examples and descriptions of TEM images in our datasets. References put some images into context to make them more tangible to unfamiliar readers.

For rapid development, we also downsampled n = 3 wavefunctions from 320× 320 to 96× 96. In this section we give details of each exit wavefunction dataset, referring to them using their names in our repositories. Experimental Focal Series: 1000 experimental focal series. Each series consists of 14 32-bit 512× 512 TEM images, area downsampled from 4096× 4096 with MATLAB and default antialiasing. The images are in TIFF [98] format. All series were created with a common, quadratically increasing [99] defocus series. However, spatial scales vary and would need to be fitted as part of wavefunction reconstruction.

In detail, exit wavefunctions for a large range of physical hyperparameters were simulated with clTEM [100, 101] for acceleration voltages in {80, 200, 300} kV, material depths uniformly distributed in [5, 100) nm, material widths in [5, 10) nm, and crystallographic zone axes (h, k, l) h, k, l ∈ {0, 1, 2}. Materials were padded on all sides with vacuum 0.8 nm wide and 0.3 nm deep to reduce simulation artefacts. Finally, crystal tilts were perturbed by zero-centred Gaussian random variates with 0. 

The best dataset variant varies for different applications. Full-sized datasets can always be used as other dataset variants are derived from them. However, loading and processing full-sized examples may bottleneck training, and it is often unnecessary. Instead, smaller 512× 512 crops, which can be loaded more quickly the full-sized images, can often be used to train ANNs to be applied convolutionally [102] to or tiled across [4] full-sized inputs. In addition, our 96× 96 datasets can be used for rapid initial development before scaling up 10 111 Mach. Learn.: Sci. Technol. 1 (2020) 045003 J M Ede to full-sized datasets, similar to how ANNs might be trained with CIFAR-10 before scaling up to ImageNet. However, subtle application-and dataset-specific considerations may also influence the best dataset choice. For example, an ANN trained with downsampled 96× 96 inputs may not generalize to 96× 96 crops from full-sized inputs as downsampling may introduce artifacts [103] and change noise or other data characteristics.

In practice, electron microscopists image most STEM and TEM signals at several times their Nyquist rates [77] . This eases visual inspection, decreases sub-Nyquist aliasing [104] , improves display on computer monitors, and is easier than carefully tuning sampling rates to capture the minimum data needed to resolve signals. High sampling may also reveal additional high-frequency information when images are inspected after an experiment. However, this complicates ANN development as it means that information per pixel is often higher in downsampled images. For example, partial scans across STEM images that have been dowsampled to 96× 96 require higher coverages than scans across 96× 96 crops for ANNs to learn to complete images with equal performance [5] . It also complicates the comparison of different approaches to compressed sensing. For example, we suggested that sampling 512× 512 crops at a regular grid of probing locations outperforms sampling along spiral paths as a subsampling grid can still access most information [5] .

Test set performance should be calculated for a standardized dataset partition to ease comparison with other methods. Nevertheless, training and validation partitions can be varied to investigate validation variance for partitions with different characteristics. Default training and validation sets for STEM and TEM datasets contain contributions from different scientists that have been concatenated or numbered in order, so new validation partitions can be selected by concatenating training and validation partitions and moving the window used to select the validation set. Similarly, exit wavefunctions were simulated with CIFs from different journals that were concatenated or numbered sequentially. There is leakage [105, 106] between training, validation and test sets due to overlap between materials published in different journals and between different scientists' work. However, further leakage can be minimized by selecting dataset partitions before any shuffling and, for wavefunctions, by ensuring that simulations for each journal are not split between partitions.

Experimental STEM and TEM image quality is variable. Images were taken by scientists with all levels of experience and TEM images were taken on multiple microscopes. This means that our datasets contain images that might be omitted from other datasets. For example, the tSNE visualization for STEM in figure 3 includes incomplete scans,~50 blank images, and images that only contain noise. Similarly, the tSNE visualization for TEM in figure 4 revealed some images where apertures block electrons, and that there are small number of unprocessed standard diffraction and convergent beam electron diffraction [107] patterns. Although these conventionally low-quality images would not normally be published, they are important to ensure that ANNs are robust for live applications. In addition, inclusion of conventionally low-quality images may enable identification of this type of data. We encourage readers to try our interactive tSNE visualizations [8] for detailed inspection of our datasets.

In this paper, we present tSNE visualizations of VAE latent spaces to show image variety. However, our VAEs can be directly applied to a wide range of additional applications. For example, successful tSNE clustering of latent spaces suggests that VAEs could be used to create a hash table [108, 109] for an electron micrograph search engine. VAEs can also be applied to semantic manipulation [110] , and clustering in tSNE visualizations may enable subsets of latent space that generate interesting subsets of data distributions to be identified. Other applications include using clusters in tSNE visualizations to label data for supervised learning, data compression, and anomaly detection [111, 112] . To encourage further development, we have made our source code and pretrained VAEs openly available [8] .

We have presented details of and visualizations for large new electron microscopy datasets that are openly available from our new repositories. Datasets have been carefully partitioned into training, validation, and test sets for machine learning. Further, we provide variants containing 512× 512 crops to reduce data loading times, and examples downsampled to 96× 96 for rapid development. To improve dataset visualization with VAEs, we introduce encoding normalization and regularization, and add an image gradient loss. In addition, we propose extending tSNE to account for encoded standard deviations. Source code, pretrained VAEs, precompiled tSNE binaries, and interactive dataset visualizations are provided in supplementary repositories to help users become familiar with our datasets and visualizations. By making our datasets available, we aim to encourage standardization of performance benchmarks in electron microscopy and increase participation of the wider computer science community in electron microscopy research.

Ten additional tSNE visualizations are provided as supplementary information. They are for: Interactive versions of tSNE visualizations that display data when map points are hovered over are available [8] for every figure. In addition, we propose an algorithm to increase whitespace utilization in tSNE visualizations by uniformly separating points, and show that our VAEs can be used as the basis of image search engines. Supplementary information is openly available at https://doi.org/10.5281/zenodo.3899740 and stacks.iop.org/MLST/1/045003/mmedia.

The data that support the findings of this study are openly available at https://doi.org/10.5281/zenodo.3834197. For additional information contact the corresponding author (J.M.E.). 

Figure numbers for a variety of two-dimensional tSNE visualisations are tabulated in table S1 to ease comparison. Visualizations are for the first 50 principal components extracted by a scikit-learn 1 implementation of probabilistic PCA, means encoded in 64-dimensional VAE latent spaces without modified tSNE losses to account for standard deviations, and means encoded in 64-dimensional VAE latent spaces with modified tSNE losses to account for standard deviations. Interactive versions of tSNE visualizations that display data when map points are hovered over are also available for every visualization 2 . In addition, our source code, graph points and datasets are openly available. Table S1 . To ease comparison, we have tabulated figure numbers for tSNE visualizations. Visualizations are for principal components, VAE latent space means, and VAE latent space means weighted by standard deviations..

Visualization of complex exit wavefunctions is complicated by the display of their real and imaginary components. However, real and imaginary components are related 3 , and can be visualized in the same image by displaying them in different colour channels. For example, we show real and imaginary components in red and blue colour channels, respectively, in figures S4-S6. Note that a couple of extreme points are cropped from some of the tSNE visualizations of principal components in figures S1-S6. However, this only affected ∼0.01% of points and therefore does not have a substantial effect on visualizations. In contrast, tSNE visualizations of VAE latent spaces did not have extreme points.

Limitedly, most tSNE visualizations do not fully utilize whitespace. This is problematic as space is often limited in journals, websites and other media. As a result, we propose algorithm 1 to uniformly separate map points. This increases whitespace utilization while keeping clustered points together. Example applications are shown in figures S11-S13, where images nearest points on a regular grid are shown at grid points. Uniformly separating map points removes information about pairwise distances encoded in the tSNE distributions. However, distances and cluster sizes in tSNE visualizations are not overly meaningful 4 . Overall, we think that uniformly separated tSNE is an interesting option that could be improved by further development. To this end, our source code and graph points for uniformly separated tSNE visualizations are openly available 2 .

Our VAEs can be used as the basis of image search engines. To find similar images, we compute Euclidean distances between means encoded for search inputs and images in the STEM or TEM datasets, then select images at lowest distances. Examples of top-5 search results for various input images are shown in figure S14 and figure S15 for TEM and STEM, respectively. Search results are most accurate for common images and are less accurate for unusual images. The main difference between the performance of our search engines and Google, Bing or other commercial image search engines is the result of commercial ANNs being trained with over 100× more training data, 3500× more computational resources and larger images c.f. Xception 5 .

However, the performance of our search engines is okay and our VAEs could easily be scaled up. Our source code, pretrained models and VAE encodings for each dataset are openly available 2 .

Algorithm 1 Two-dimensional Bayesian inverse transformed tSNE. We default to a h = w = 25 grid.

Linearly transform dimensions to have values in [0, 1],

Divide points into an evenly spaced grid with h×w cells. Compute number of points in each cell, n ab , a ∈ [1, h], b ∈ [1, w]. Cumulative numbers of points using the recurrent relations,

where c 0 = c 0|a = 0. Estimate Bayesian conditional cumulative distribution functions, 

Interpolate uniformly separated grid positions, Y, for X based on pairs of grid and distribution points. We use Clough-Tocher cubic Bezier interpolation 6 . S9/S16 Figure S8 . Two-dimensional tSNE visualization of means parameterized by 64-dimensional VAE latent spaces for 19769 96×96 crops from STEM images. The same grid is used to show a) map points and b) images at 500 randomly selected points. S10/S16 Figure S9 . Two-dimensional tSNE visualization of means parameterized by 64-dimensional VAE latent spaces for 19769 TEM images that have been downsampled to 96×96. The same grid is used to show a) map points and b) images at 500 randomly selected points. S11/S16 Figure S10 . Two-dimensional tSNE visualization of means and standard deviations parameterized by 64-dimensional VAE latent spaces for 19769 96×96 crops from STEM images. The same grid is used to show a) map points and b) images at 500 randomly selected points. S12/S16 Figure S11 . Two-dimensional uniformly separated tSNE visualization of 64-dimensional VAE latent spaces for 19769 96×96 crops from STEM images. Figure S12 . Two-dimensional uniformly separated tSNE visualization of 64-dimensional VAE latent spaces for 19769 STEM images that have been downsampled to 96×96. S13/S16 Figure S13 . Two-dimensional uniformly separated tSNE visualization of 64-dimensional VAE latent spaces for 17266 TEM images that have been downsampled to 96×96. S14/S16 Figure S14 . Examples of top-5 search results for 96×96 TEM images. Euclidean distances between µ encoded for search inputs and results are smaller for more similar images. S15/S16 Figure S15 . Examples of top-5 search results for 96×96 STEM images. Euclidean distances between µ encoded for search inputs and results are smaller for more similar images. S16/S16 132 

There are amendments or corrections to the paper 2 covered by this chapter.

Location: Page 4, caption of fig. 2 .

Change: "...at 500 randomly selected images..." should say "...at 500 randomly selected data points...".

This ancillary chapter covers my paper titled "Warwick Electron Microscopy Datasets" 2 176 . Further, a distribution of generated images could be sensitive to ANN architecture and learning policy, including when training is stopped 177, 178 . Nevertheless, I expect that data generated from by VAEs could be used for pretraining to improve ANN robustness 179 . Overall, I think it will become increasingly practical to use VAEs or GANs as high-quality data generators as ANN architectures and learning policies are improved.

Perhaps the main limitation of my paper is that I did not introduce my preferred abbreviation, "WEMD", for "Warwick Electron Microscopy Datasets". Further, I did not define "WEMD" in my WEMD preprint 13 .

Subsequently, I introduced my preferred abbreviation in my review of deep learning in electron microscopy 1 (ch. 1).

I also defined an abbreviation, "WLEMD", for "Warwick Large Electron Microscopy Datasets" in the first version of the partial STEM preprint 18 Science and Technology, 1:015011, 2020

Loss spikes arise when artificial neural networks (ANNs) encounter difficult examples and can destabilize training with gradient descent [1, 2] . Examples may be difficult because an ANN needs more training to generalize, catastrophically forgot previous learning [3] or because an example is complex or unusual. Whatever the cause, applying gradients backpropagated [4] from high losses results in large perturbations to trainable parameters. When a trainable parameter perturbation is much larger than others, learning can be destabilized while parameters adapt. This behaviour is common for ANN training with gradient descent where a large portion of parameters is perturbed at each optimization step. In contrast, biological networks often perturb small portions of neurons to combine new learning with previous learning. Similar to biological networks, ANN layers can become more specialized throughout training [5] and specialized capsule networks [6] are being developed. Nevertheless, ANN loss spikes during optimization are still a common reason for learning instability. Loss spikes are common for training with small batch sizes, high order loss functions, and unstably high learning rates.

During ANN training by stochastic gradient descent [1] (SGD), a trainable parameter, θ t , from step t is updated to θ t + 1 in step t + 1. The size of the update is given by the product of a learning rate, η, and the backpropagated gradient of a loss function with respect to the trainable parameter

Without modification, trainable parameter perturbations are proportional to the scale of the loss function. Following gradient backpropagation, a high loss spike can cause a large perturbation to a learned parameter distribution. Learning will then be destabilized while subsequent iterations update trainable parameters back to an intelligent distribution.

Trainable parameter perturbations are often limited by clipping gradients to a multiple of their global L2 norm [7] . For large batch sizes, this can limit perturbations by loss spikes as their gradients will be larger than other gradients in the batch. However, global L2 norm clipping alters the distribution of gradients backpropagated from high losses and is unable to identify and clip high losses if the batch size is small. Clipping gradients of individual layers by their L2 norms has the same limitations.

Gradient clipping to a user-provided threshold [8] can also be applied globally or to individual layers. This can limit loss spike perturbations for any batch size. However, the clipping threshold is an extra hyperparameter to determine and may need to be changed throughout training. Further, it does not preserve distributions of gradients for high losses.

More commonly, destabilizing perturbations are reduced by selecting a low order loss function and stable learning rate. Low order loss functions, such as absolute and squared distances, are effective because they are less susceptible to destabilizingly high errors than higher-order loss functions. Indeed, loss function modifications used to stabilize learning often lower loss function order. For instance, Huberization [9, 10] reduces perturbations by losses, L, larger than h by applying the mapping L → min(L, (hL) 1/2 ).

Adaptive learning rate clipping (ALRC, algorithm 1) is designed to address the limitations of gradient clipping. Namely, to be computationally inexpensive, effective for any batch size, robust to hyperparameter choices and to preserve backpropagated gradient distributions. Like gradient clipping, ALRC also has to be applicable to arbitrary loss functions and neural network architectures.

Rather than allowing loss spikes to destabilize learning, ALRC applies the mapping ηL → stop_gradient(L max /L)ηL if L > L max . The function stop g radient leaves its operand unchanged in the forward pass and blocks gradients in the backwards pass. ALRC adapts the learning rate to limit the effective loss being backpropagated to L max . The value of L max is non-trivial for ALRC to complement existing learning algorithms. In addition to training stability and robustness to hyperparameter choices, L max needs to adapt to losses and learning rates as they vary.

In our implementation, L max and L min are numbers of standard deviations of the loss above and below its mean, respectively. ALRC has six hyperparameters; however, it is robust to their values. There are two decay rates, β 1 and β 2 , for exponential moving averages used to estimate the mean and standard deviation of the loss and a number, n, of standard deviations. Similar to batch normalization [11] , any decay rate close to 1 is effective e.g. β 1 = β 2 = 0.999. Performance does vary slightly with n max ; however, we found that any n max ≈ 3 is effective. Varying n min is an optional extension and we default to one-sided ALRC above i.e. n min = ∞. Initial values for the running means, µ 1 and µ 2 , where µ 2 1 < µ 2 also have to be provided. However, any sensible initial estimates larger than their true values are fine as µ 1 and µ 2 will decay to their correct values.

ALRC can be extended to any loss function or batch size. For batch sizes above 1, we apply ALRC to individual losses, while µ 1 and µ 2 are updated with mean losses. ARLC can also be applied to loss summands, such as per pixel errors between generated and reference images, while µ 1 and µ 2 are updated with the mean errors.

Algorithm 1 Two-sided adaptive learning rate clipping (ALRC) of loss spikes. Sensible parameters are β1 = β2 = 0.999, n min = ∞, nmax = 3, and µ 2 1 < µ2.

Initialize running means, µ1 and µ2, with decay rates, β1 and β2. Choose number, n, of standard deviations to clip to. While Training is not finished do Infer forward-propagation loss, L.

Optimize network by back-propagating L dyn . µ1 ← β1µ1 + (1 − β1)L µ2 ← β2µ2 + (1 − β2)L 2 end while 2 Figure 1 . Unclipped learning curves for 2× CIFAR-10 supersampling with batch sizes 1, 4, 16 and 64 with and without adaptive learning rate clipping of losses to 3 standard deviations above their running means. Training is more stable for squared errors than quartic errors. Learning curves are 500 iteration boxcar averaged. 

To investigate the ability of ALRC to stabilize learning and its robustness to hyperparameter choices, we performed a series of toy experiments with networks trained to supersample CIFAR-10 [12, 13] images to 32×32×3 after downsampling to 16×16×3.

3 Figure 2 . Unclipped learning curves for 2× CIFAR-10 supersampling with ADAM and SGD optimizers at stable and unstably high learning rates, η. Adaptive learning rate clipping prevents loss spikes and decreases errors at unstably high learning rates. Learning curves are 500 iteration boxcar averaged.

In order, images were randomly flipped left or right, had their brightness altered, had their contrast altered, were linearly transformed to have zero mean and unit variance and bilinearly downsampled to 16×16×3. Architecture: Images were upsampled and passed through a convolutional neural network [14, 15] shown in figure 5 . Each convolutional layer is followed by ReLU [16] activation, except the last. Initialization: All weights were Xavier [17] initialized. Biases were zero initialized. Learning policy: ADAM optimization was used with the hyperparameters recommended in [18] and a base learning rate of 1/1280 for 100 000 iterations. The learning rate was constant in batch size 1, 4, 16 experiments and decreased to 1/12 800 after 54 687 iterations in batch size 64 experiments. Networks were trained to minimize mean squared or quartic errors between restored and ground truth images. ALRC was applied to limit the magnitudes of losses to either 2, 3, 4 or ∞ standard deviations above their running means. For batch sizes above 1, ALRC was applied to each loss individually. Results: Example learning curves for mean squared and quartic error training are shown in figure 1 . Training is more stable and converges to lower losses for larger batch sizes. However, learning is less stable for quartic errors than squared errors, allowing ALRC to be examined for loss functions with different stability. Training was repeated 10 times for each combination of ALRC threshold and batch size. Means and standard deviations of the means of the last 5000 training losses for each experiment are tabulated in table 1. ALRC has no effect on mean squared error (MSE) training, even for batch size 1. However, it decreases errors for batch sizes 1, 4 and 16 for mean quartic error training. Additional learning curves are shown in figure 2 for both ADAM and SGD optimizers to showcase the effect of ALRC on unstably high learning rates. Experiments are for a batch size of 1. ALRC has no effect at stable learning rates where learning is unaffected by loss spikes. However, ALRC prevents loss spikes and decreases errors at unstably high learning rates. In addition, these experiments show that ALRC is effective for different optimizers.

To test ALRC in practice, we applied our algorithm to neural networks learning to complete 512× 512 scanning transmission electron microscopy (STEM) images [19] from partial scans [20] with 1/20 coverage. Example completions are shown in figure 3 . Data pipeline: In order, each image was subject to a random combination of flips and 90

• rotations to augment the dataset by a factor of 8. Next, each STEM image was blurred, and a path described by a 1/20 coverage spiral was selected. Finally, artificial noise was added to scans to make them more difficult to complete. Architecture: Our network can be divided into three subnetworks shown in figure 6 : an inner generator, outer generator and an auxiliary inner generator trainer. The auxiliary trainer [21, 22] is introduced to provide a more direct path for gradients to backpropagate to the inner generator. Each convolutional layer is followed by ReLU activation, except the last. Initialization: Weights were initialized from a normal distribution with mean 0.00 and standard deviation 0.05. There are no biases. Weight normalization: All generator weights are weight normalized [23] and a weight normalization initialization pass was performed after weight initialization. Following [23, 24] , running mean-only batch normalization was applied to the output channels of every convolutional layer except the last. Channel means were tracked by exponential moving averages with decay rates of 0.99. Similar to [25] , running mean-only batch normalization was frozen in the second half of training to improve stability. Loss functions: The auxiliary inner generator trainer learns to generate half-size completions that minimize MSEs from half-size blurred ground truth STEM images. Meanwhile, the outer generator learns to produce full-size completions that minimize MSEs from blurred STEM images. All MSEs were multiplied by 200. The inner generator cooperates with the auxiliary inner generator trainer and outer generator. To benchmark ALRC, we investigated training with MSEs, Huberized (h = 1) MSEs, MSEs with ALRC and Huberized (h = 1) MSEs with ALRC before Huberization. Training with both ALRC and Hubarization showcases the ability of ALRC to complement another loss function modification. Learning policy: ADAM optimization [18] was used with a constant generator learning rate of 0.0003 and a first moment of the momentum decay rate, β 1 = 0.9, for 250 000 iterations. In the next 250 000 iterations, the learning rate and β 1 were linearly decayed in eight steps to zero and 0.5, respectively. The learning rate for the auxiliary inner generator trainer was two times the generator learning rate; β 1 were the same. All training was performed with batch size 1 due to the large model size needed to complete 512× 512 scans. Results: Outer generator losses in figure 4 show that ALRC and Huberization stabilize learning. Further, ALRC accelerates MSE and Huberized MSE convergence to lower losses. To be clear, learning policy was optimized for MSE training so direct loss comparison is uncharitable to ALRC. Initialize running means, µ1, µ ↓ and µ ↑ , with decay rates, β1, β ↓ and β ↑ . Choose numbers, n, of standard deviations to clip to. While Training is not finisheddo Infer forward-propagation loss, L. 

L min ← µ1 − n ↓ µ ↓ Lmax ← µ1 + n ↑ µ ↑ if L < L min then L dyn ← stop_gradient(L min /L)L else if L > Lmax then L dyn ← stop_gradient(Lmax/L)L else L dyn ← L end if Optimize network by back-propagating L dyn . if L > µ1 then µ ↑ ← β ↑ µ ↑ + (1 − β ↑ )(L − µ1) else if L < µ1 then µ ↓ ← β ↓ µ ↓ + (1 − β ↓ )(µ1 − L) end if µ1 ← β1µ1 + (1 − β1

ALRC was developed to limit perturbations by loss spikes. Nevertheless, ALRC can also increase parameter perturbations for low losses, possibly improving performance on examples that an ANN is already good at. To investigate ALRC variants, we trained a generator to supersample STEM images to 512× 512 after nearest neighbour downsampling to 103× 103. Network architecture and learning protocols are the same as those for partial STEM in section 4, except training iterations are increased from 5×10 5 to 10 6 .

Means and standard deviations of 20 000 unclipped test set MSEs for possible ALRC variants are tabulated in table 2. Variants include constant learning rate clipping (CLRC) in algorithm 2; where the effective loss is kept between constant values, and doubly adaptive learning rate clipping (DALRC) in algorithm 3; where moments above and below a running mean are tracked separately. ALRC has the lowest test set MSEs whereas DALRC has lower variance. Both ALRC and DLRC outperform no learning rate clipping for all tabulated hyperparameters and may be a promising starting point for future research on learning rate clipping.

Taken together, our CIFAR-10 supersampling results show that ALRC improves stability and lowers losses for learning that would be destabilized by loss spikes and otherwise has little effect. Loss spikes are often encountered when training with high learning rates, high order loss functions or small batch sizes. For example, a moderate learning rate was used in MSE experiments so that losses did not spike enough to destabilize learning. In contrast, training at the same learning rate with quartic errors is unstable so ALRC stabilizes learning and lowers losses. Similar results are confirmed at unstably high learning rates, for partial STEM and for STEM supersampling, where ALRC stabilizes learning and lowers losses.

ALRC is designed to complement existing learning algorithms with new functionality. It is effective for any loss function or batch size and can be applied to any neural network trained with gradient descent. Our algorithm is also computationally inexpensive, requiring orders of magnitude fewer operations than other layers typically used in neural networks. As ALRC either stabilizes learning or has little effect, this means that it is suitable for routine application to arbitrary neural network training with gradient descent. In addition, we note that ALRC is a simple algorithm that has a clear effect on learning.

Nevertheless, ALRC can replace other learning algorithms in some situations. For instance, ALRC is a computationally inexpensive alternative to gradient clipping in high batch size training where gradient 7 Figure 6 . Two-stage generator that completes 512× 512 micrographs from partial scans. A dashed line indicates that the same image is input to the inner and outer generator. Large scale features developed by the inner generator are locally enhanced by the outer generator and turned into images. An auxiliary inner generator trainer restores images from inner generator features to provide direct feedback. clipping is being used to limit perturbations by loss spikes. However, it is not a direct replacement as ALRC preserves the distribution of backpropagated gradients whereas gradient clipping reduces large gradients. Instead, ALRC is designed to complement gradient clipping by limiting perturbations by large losses while gradient clipping modifies gradient distributions.

The implementation of ALRC in algorithm 1 is for positive losses. This avoids the need to introduce small constants to prevent divide-by-zero errors. Nevertheless, ALRC can support negative losses by using standard methods to prevent divide-by-zero errors. Alternatively, a constant can be added to losses to make them positive without affecting learning.

ALRC can also be extended to limit losses more than a number of standard deviations below their mean. This had no effect in our experiments. However, preemptively reducing loss spikes by clipping rewards between user-provided upper and lower bounds can improve reinforcement learning [26] . Subsequently, we suggest that clipping losses below their means did not improve learning because losses mainly spiked above their means; not below. Some partial STEM losses did spike below; however, they were mainly for blank or otherwise trivial completions. 8 

We have developed ALRC to stabilize the training of ANNs by limiting backpropagated loss perturbations. Our experiments show that ALRC accelerates convergence and lowers losses for learning that would be destabilized by loss spikes and otherwise has little effect. Further, ALRC is computationally inexpensive, can be applied to any loss function or batch size, does not affect the distribution of backpropagated gradients and has a clear effect on learning. Overall, ALRC complements existing learning algorithms and can be routinely applied to arbitrary neural network training with gradient descent.

The data that support the findings of this study are openly available. Source code based on TensorFlow [27] is provided for CIFAR-10 supersampling [28] and partial STEM [29] , and both CIFAR-10 [12] and STEM [19] datasets are available. For additional information contact the corresponding author (J M E ).

ANN architecture for CIFAR-10 experiments is shown in figure 5 , and architecture for STEM partial scan and supersampling experiments is shown in figure 6 . The components in our networks are Bilinear Downsamp, wxw: This is an extension of linear interpolation in one dimension to two dimensions. It is used to downsample images to w × w. Bilinear Upsamp, xs: This is an extension of linear interpolation in one dimension to two dimensions. It is used to upsample images by a factor of s. Conv d, wxw, Stride, x: Convolutional layer with a square kernel of width, w, that outputs d feature channels. If the stride is specified, convolutions are only applied to every xth spatial element of their input, rather than to every element. Striding is not applied depthwise. + ⃝: Circled plus signs indicate residual connections [30] where tensors are added together. Residual connections help reduce signal attenuation and allow networks to learn perturbative transformations more easily.

There are amendments or corrections to the paper 3 covered by this chapter. fig. 1 .

Change: A title above the top two graphs is cut off. The missing title said "With Adaptive Learning Rate Clipping", and is visible in our preprint 16 .

Location: Last paragraph starting on page 7.

Change: "...inexpensive alternative to gradient clipping in high batch size training where..." should say "...inexpensive alternative to gradient clipping where...".

This ancillary chapter covers my paper titled "Adaptive Learning Rate Clipping Stabilizes Learning" 3 and associated research outputs 16, 17 . The ALRC algorithm was developed to prevent loss spikes destabilizing training of DNNs for partial STEM 4 (ch. 4). To fit the partial STEM ANN in GPU memory, it was trained with a batch size of 1.

However, using a small batch size results in occasional loss spikes, which meant that it was sometimes necessary to repeat training to compare performance with earlier experiments where learning had not been destabilized by loss spikes. I expected that I could adjust training hyperparameters to stabilize learning; however, I had optimized the hyperparameters and training was usually fine. Thus, I developed ALRC to prevent loss spikes from destabilizing learning. Initially, ALRC was included as an appendix in the first version of the partial STEM preprint 18 . However, ALRC was so effective that I continued to investigate. Eventually, there were too many ALRC experiments to comfortably fit in an appendix of the partial STEM paper, so I separated ALRC into its own paper.

There are variety of alternatives to ALRC that can stabilize learning. A popular alternative is training with where L is a loss and λ is a training hyperparameter. However, I found that Huberized learning continued to be destabilized by loss spikes. I also considered gradient clipping [183] [184] [185] Aberration corrected scanning transmission electron microscopy (STEM) can achieve imaging resolutions below 0.1 nm, and locate atom columns with pm precision 1,2 . Nonetheless, the high current density of electron probes produces radiation damage in many materials, limiting the range and type of investigations that can be performed 3, 4 . A number of strategies to minimize beam damage have been proposed, including dose fractionation 5 and a variety of sparse data collection methods 6 . Perhaps the most intensively investigated approach to the latter is sampling a random subset of pixels, followed by reconstruction using an inpainting algorithm 3,6-10 . Poisson random sampling of pixels is optimal for reconstruction by compressed sensing algorithms 11 . However, random sampling exceeds the design parameters of standard electron beam deflection systems, and can only be performed by collecting data slowly 12, 13 , or with the addition of a fast deflection or blanking system 3, 14 .

Sparse data collection methods that are more compatible with conventional beam deflection systems have also been investigated. For example, maintaining a linear fast scan deflection whilst using a widely-spaced slow scan axis with some small random 'jitter' 9,12 . However, even small jumps in electron beam position can lead to a significant difference between nominal and actual beam positions in a fast scan. Such jumps can be avoided by driving functions with continuous derivatives, such as those for spiral and Lissajous scan paths 3, 13, 15, 16 . Sang 13, 16 considered a variety of scans including Archimedes and Fermat spirals, and scans with constant angular or linear displacements, by driving electron beam deflectors with a field-programmable gate array (FPGA) based system. Spirals with constant angular velocity place the least demand on electron beam deflectors. However, dwell times, and therefore electron dose, decreases with radius. Conversely, spirals created with constant spatial speeds are prone to systematic image distortions due to lags in deflector responses. In practice, fixed doses are preferable as they simplify visual inspection and limit the dose dependence of STEM noise 17 .

Deep learning has a history of successful applications to image infilling, including image completion 18 , irregular gap infilling 19 and supersampling 20 . This has motivated applications of deep learning to the completion of sparse, or 'partial' , scans, including supersampling of scanning electron microscopy 21 (SEM) and STEM images 22, 23 . Where pre-trained models are unavailable for transfer learning 24 , artificial neural networks (ANNs) are typically trained, validated and tested with large, carefully partitioned machine learning datasets 25, 26 so that they are robust to general use. In practice, this often requires at least a few thousand examples. Indeed, standard machine learning datasets such as CIFAR-10 27,28 , MNIST 29 , and ImageNet 30 contain tens of thousands or millions of examples. To train an ANN to complete STEM images from partial scans, an ideal dataset might consist of a large number of pairs of partial scans and corresponding high-quality, low noise images, taken with an aberration-corrected STEM. To our knowledge, such a dataset does not exist. As a result, we have collated a new dataset of STEM raster scans from which partial scans can be selected. Selecting partial scans from full scans is Examples of spiral and jittered gridlike partial scans investigated in this paper are shown in Fig. 1 . Continuous spiral scan paths that extend to image corners cannot be created by conventional scan systems without going over image edges. However, such a spiral can be cropped from a spiral with radius at least 2 −1/2 times the minimum image side, at the cost of increased scan time and electron beam damage to the surrounding material. We use Archimedes spirals, where θ ∝ r , and r and θ are polar radius and angle coordinates, as these spirals have the most uniform spatial coverage. Jittered gridlike scans would also be difficult to produce with a conventional system, which would suffer variations in dose and distortions due to limited beam deflector response. Nevertheless, these idealized scan paths serve as useful inputs to demonstrate the capabilities of our approach. We expect that other scan paths could be used with similar results.

We fine-tune our ANNs as part of generative adversarial networks 31 (GANs) to complete realistic images from partial scans. A GAN consists of sets of generators and discriminators that play an adversarial game. Generators learn to produce outputs that look realistic to discriminators, while discriminators learn to distinguish between real and generated examples. Limitedly, discriminators only assess whether outputs look realistic; not if they are correct. This can result in a neural network only generating a subset of outputs, referred to as mode collapse 32 . To counter this issue, generator learning can be conditioned on an additional distance between generated and true images 33 . Meaningful distances can be hand-crafted or learned automatically by considering differences between features imagined by discriminators for real and generated images 34,35 . training In this section we introduce a new STEM images dataset for machine learning, describe how partial scans were selected from images in our data pipeline, and outline ANN architecture and learning policy. Detailed ANN architecture, learning policy, and experiments are provided as Supplementary Information, and source code is available 36 . Data pipeline. To create partial scan examples, we collated a new dataset containing 16227 32-bit floating point STEM images collected with a JEOL ARM200F atomic resolution electron microscope. Individual micrographs were saved to University of Warwick data servers by dozens of scientists working on hundreds of projects as Gatan Microscopy Suite 37 generated dm3 or dm4 files. As a result, our dataset has a diverse constitution. Atom columns are visible in two-thirds of STEM images, with most signals imaged at several times their Nyquist rates 38 , and similar proportions of images are bright and dark field. The other third of images are at magnifications too low for atomic resolution, or are of amorphous materials. Importantly, our dataset contains noisy images, incomplete scans and other low-quality images that would not normally be published. This ensures that ANNs trained on our dataset are robust to general use. The Digital Micrograph image format is rarely used outside the microscopy community. As a result, data has been transferred to the widely supported TIFF 39 file format in our publicly available dataset 40, 41 .

Micrographs were split into 12170 training, 1622 validation, and 2435 test set examples. Each subset was collected by a different subset of scientists and has different characteristics. As a result, unseen validation and test sets can be used to quantify the ability of a trained network to generalize. To reduce data read times, each micrograph was split into non-overlapping 512 × 512 sub-images, referred to as 'crops' , producing 110933 training, 21259 validation and 28877 test set crops. For convenience, our crops dataset is also available 40, 41 . Each crop, I, was processed in our data pipeline by replacing non-finite electron counts, i.e. NaN and ±∞, with zeros. Crops were then linearly transformed to have intensities ∈ − I [ 1, 1] N , except for uniform crops satisfying Partial scans, I scan , were selected from raster scan crops, I N , by multiplication with a binary mask Φ path ,

scan path N where Φ = 1 path on a scan path, and Φ = 0 path otherwise. Raster scans are sampled at a rectangular lattice of discrete locations, so a subset of raster scan pixels are experimental measurements. In addition, although electron probe position error characteristics may differ for partial and raster scans, typical position errors are small 42, 43 . As a result, we expect that partial scans selected from raster scans with binary masks are realistic.

We also selected partial scans with blurred masks to simulate varying dwell times and noise characteristics. These difficulties are encountered in incoherent STEM 44, 45 , where STEM illumination is detected by a transmission electron microscopy (TEM) camera. For simplicity, we created non-physical noise by multiplying I scan with

where U is a uniform random variate distributed in [0, 2) . ANNs are able to generalize 46,47 , so we expect similar results for other noise characteristics. A binary mask, with values in {0, 1}, is a special case where no noise is applied i.e. η = (1) 1, and Φ = 0 path is not traversed. Performance is reported for both binary and blurred masks.

The noise characteristics in our new STEM images dataset vary. This is problematic for mean squared error (MSE) based ANN training losses, as differences are higher for crops with higher noise. In effect, this would increase the importance of noisy images in the dataset, even if they are not more representative. Although adaptive ANN optimizers that divide parameter learning rates by gradient sizes 48 can partially mitigate weighting by varying noise levels, this restricts training to a batch size of 1 and limits momentum. Consequently, we low-passed filtered ground truth images, I N , to I blur by a 5 × 5 symmetric Gaussian kernel with a 2.5 px standard deviation, to calculate MSEs for ANN outputs.

Network architecture. To generate realistic images, we developed a multiscale conditional GAN with TensorFlow 49 . Our network can be partitioned into the six convolutional 50,51 subnetworks shown in Fig. 2 : an inner generator, G inner , outer generator, G outer , inner generator trainer, T , and small, medium and large scale discriminators, D 1 , D 2 and D 3 . We refer to the compound network

scan outer inner scan s can as the generator, and to D = {D 1 , D 2 , D 3 } as the multiscale discriminator. The generator is the only network needed for inference.

Following recent work on high-resolution conditional GANs 34 , we use two generator subnetworks. The inner generator produces large scale features from partial scans bilinearly downsampled from 512 × 512 to 256 × 256. These features are then combined with inputs embedded by the outer generator to output full-size completions. Following Inception 52,53 , we introduce an auxiliary trainer network that cooperates with the inner generator to output 256 × 256 completions. This acts as a regularization mechanism, and provides a more direct path for www.nature.com/scientificreports www.nature.com/scientificreports/ gradients to backpropagate to the inner generator. To more efficiently utilize initial generator convolutions, partial scans selected with a binary mask are nearest neighbour infilled before being input to the generator.

Multiscale discriminators examine real and generated STEM images to predict whether they are real or generated, adapting to the generator as it learns. Each discriminator assesses different-sized crops selected from 512 × 512 images, with sizes 70 × 70, 140 × 140 or 280 × 280. After selection, crops are bilinearly downsampled to 70 × 70 before discriminator convolutions. Typically, discriminators are applied at fractions of the full image size 34 e.g. 512/2 2 , 512/2 1 and 512/2 0 . However, we found that discriminators that downsample large fields of view to 70 × 70 are less sensitive to high-frequency STEM noise characteristics. Processing fixed size image regions with multiple discriminators has been proposed 54 to decrease computation for large images, and extended to multiple region sizes 34 . However, applying discriminators to arrays of non-overlapping image patches 55 results in periodic artefacts 34 that are often corrected by larger-scale discriminators. To avoid these artefacts and reduce computation, we apply discriminators to randomly selected regions at each spatial scale.

Learning policy. Training has two halves. In the non-adversarial first half, the generator and auxiliary trainer cooperate to minimize mean squared errors (MSEs). This is followed by an optional second half of training, where the generator is fine-tuned as part of a GAN to produce realistic images. Our ANNs are trained by ADAM 56 optimized stochastic gradient descent 48,57 for up to 2 × 10 6 iterations, which takes a few days with an Nvidia GTX 1080 Ti GPU and an i7-6700 CPU. The objectives of each ANN are codified by their loss functions.

In the non-adversarial first half of training, the generator, G, learns to minimize the MSE based loss , and adaptive learning rate clipping 58 (ALRC) is important to prevent high loss spikes from destabilizing learning. Experiments with and without ALRC are in Supplementary Information. To compensate for varying noise levels, ground truth images were blurred by a 5 × 5 symmetric Gaussian kernel with a 2.5 px standard deviation. In addition, the inner generator, G inner , cooperates with the auxiliary trainer, T , to minimize In the optional adversarial second half of training, we use = N 3 discriminator scales with numbers, N 1 , N 2 and N 3 , of discriminators, D 1 , D 2 and D 3 , respectively. There many popular GAN loss functions and regularization mechanisms 59, 60 . In this paper, we use spectral normalization 61 with squared difference losses 62 for the discriminators, where discriminators try to predict 1 for real images and 0 for generated images. We found that = = = N N N 1 1 2 3 is sufficient to train the generator to produce realistic images. However, higher performance might be achieved with more discriminators e.g. 2 large, 8 medium and 32 small discriminators. The generator learns to minimize the adversarial squared difference loss, by outputting completions that look realistic to discriminators. Discriminators only assess the realism of generated images; not if they are correct. To the lift degeneracy and prevent mode collapse, we condition adversarial training on non-adversarial losses. The total generator loss is

G adv adv MSE a ux aux where we found that λ = 1 aux and λ = 5 adv is effective. We also tried conditioning the second half of training on differences between discriminator imagination 34, 35 . However, we found that MSE guidance converges to slightly lower MSEs and similar structural similarity indexes 63 for STEM images.

To showcase ANN performance, example applications of adversarial and non-adversarial generators to 1/20 px coverage partial STEM completion are shown in Fig. 3 . Adversarial completions have more realistic high-frequency spatial information and structure, and are less blurry than non-adversarial completions. Systematic spatial variation is also less noticeable for adversarial completions. For example, higher detail along spiral paths, where errors are lower, can be seen in the bottom two rows of Fig. 3 for non-adversarial completions. Inference only requires a generator, so inference times are the same for adversarial and non-adversarial completions. Single image inference time during training is 45 ms with an Nvidia GTX 1080 Ti GPU, which is fast enough for live partial scan completion.

In practice, 1/20 px scan coverage is sufficient to complete most spiral scans. However, generators cannot reliably complete micrographs with unpredictable structure in regions where there is no coverage. This is demonstrated by example applications of non-adversarial generators to 1/20 px coverage spiral and gridlike partial scans www.nature.com/scientificreports www.nature.com/scientificreports/ in Fig. 4 . Most noticeably, a generator invents a missing atom at a gap in gridlike scan coverage. Spiral scans have lower errors than gridlike scans as spirals have smaller gaps between coverage. Additional sheets of examples for spiral scans selected with binary masks are provided for scan coverages between 1/17.9 px and 1/87.0 px as Supplementary Information.

To characterize generator performance, MSEs for output pixels are shown in Fig. 5 . Errors were calculated for 20000 test set 1/20 px coverage spiral scans selected with blurred masks. Errors systematically increase with increasing distance from paths for non-adversarial training, and are less structured for adversarial training. Similar to other generators 23, 64 , errors are also higher near the edges of non-adversarial outputs where there is less information. We tried various approaches to decrease non-adversarial systematic error variation by modifying loss functions. For examples: by ALRC; multiplying pixel losses by their running means; by ALRC and www.nature.com/scientificreports www.nature.com/scientificreports/ multiplying pixel losses by their running means; and by ALRC and multiplying pixel losses by final mean losses of a trained network. However, we found that systematic errors are similar for all variants. This is a limitation of partial STEM as information decreases with increasing distance from scan paths. Adversarial completions also exhibit systematic errors that vary with distance from spiral paths. However, spiral variation is dominated by other, less structured, spatial error variation. Errors are higher for adversarial training than for non-adversarial training as GANs complete images with realistic noise characteristics. Spiral path test set intensity errors are shown in Fig. 6a , and decrease with increasing coverage for binary masks. Test set errors are also presented for deep learning supersampling 23 (DLSS) as they are the only results that are directly comparable. DLSS is an alternative approach to compressed sensing where STEM images are completed from a sublattice of probing locations. Both DLSS and partial STEM results are for the same neural network architecture, learning policy and training dataset. Results depend on datasets, so using the same dataset is essential for quantitative comparison. We find that DLSS errors are lower than spiral errors at all coverages. In addition, spiral errors exponentially increase above DLSS errors at low coverages where minimum distances from spiral paths increase. Although this comparison may appear unfavourable for partial STEM, we expect that this is a limitation of training signals being imaged at several times their Nyquist rates.

Distributions of 20000 spiral path test set root mean squared (RMS) intensity errors for spiral data in Fig. 6a are shown in Fig. 6b . The coverages listed in Fig. 6 are for infinite spiral paths with 1/16, 1/25, 1/36, 1/49, 1/64, 1/81, and 1/100 px coverage after paths are cut by image boundaries; changing coverage. All distributions have a similar peak near an RMS error of 0.04, suggesting that generator performance remains similar for a portion of images as coverage is varied. As coverage decreases, the portion of errors above the peak increases as generators have difficulty with more images. In addition, there is a small peak close to zero for blank or otherwise trivial completions.

Partial STEM can decrease scan coverage and total electron electron dose by 10-100× with 3-6% test set RMS errors. These errors are small compared to typical STEM noise. Decreased electron dose will enable new STEM applications to beam-sensitive materials, including organic crystals 65 , metal-organic frameworks 66 , nanotubes 67 , and nanoparticle dispersions 68 . Partial STEM can also decrease scan times in proportion to decreased coverage. This will enable increased temporal resolution of dynamic materials, including polar nanoregions in relaxor ferroelectrics 69, 70 , atom motion 71 , nanoparticle nucleation 72 , and material interface dynamics 73 . In addition, faster scans can reduce delay for experimenters, decreasing microscope time. Partial STEM can also be a starting point for algorithms that process STEM images e.g. to find and interpret atomic positions 74 . www.nature.com/scientificreports www.nature.com/scientificreports/ Our generators are trained for fixed coverages and 512 × 512 inputs. However, recent research has introduced loss function modifications that can be used to train a single generator for multiple coverages with minimal performance loss 23 . Using a single GAN improves portability as each of our GANs requires 1.3 GB of storage space with 32 bit model parameters, and limits technical debt that may accompany a large number of models. Although our generator input sizes are fixed, they can be tiled across larger images; potentially processing tiles in a single batch for computational efficiency. To reduce higher errors at the edge of generator outputs, tiles can be overlapped so that edges may be discarded 64 . Smaller images could be padded. Alternatively, dedicated generators can be trained for other output sizes.

There is an effectively infinite number of possible partial scan paths for 512 × 512 STEM images. In this paper, we focus on spiral and gridlike partial scans. For a fixed coverage, we find that the most effective method to decrease errors is to minimize maximum distances from input information. The less information there is about an output region, the more information that needs to be extrapolated, and the higher the error. For example, we find that errors are lower for spiral scans than gridlike scans as maximum distances from input information are lower. Really, the optimal scan shape is not static: It is specific to a given image and generator architecture. As a result, we are actively developing an intelligent partial scan system that adapts to inputs as they are scanned.

Partial STEM has a number of limitations relative to DLSS. For a start, partial STEM may require a custom scan system. Even if a scan system supports or can be reprogrammed to support custom scan paths, it may be insufficiently responsive. In contrast, DLSS can be applied as a postprocessing step without hardware modification. Another limitation of partial STEM is that errors increase with increasing distance from scan paths. Distances from continuous scan paths cannot be decreased without increasing coverage. Finally, most features in our new STEM crops dataset are sampled at several times their Nyquist rates. Electron microscopists often record images above minimum sufficient resolutions and intensities to ease visual inspection and limit the effects of drift 75 , shot 17 , and other noise. This means that a DLSS lattice can still access most high frequency information in our dataset.

Test set DLSS errors are lower than partial STEM errors for the same architecture and learning policy. However, this is not conclusive as generators were trained for a few days; rather than until validation errors diverged from training errors. For example, we expect that spirals need more training iterations than DLSS as www.nature.com/scientificreports www.nature.com/scientificreports/ nearest neighbour infilled spiral regions have varying shapes, whereas infilled regions of DLSS grids are square. In addition, limited high frequency information in training data limits one of the key strengths of partial STEM that DLSS lacks: access to high-frequency information from neighbouring pixels. As a result, we expect that partial STEM performance would be higher for signals imaged closer to their Nyquist rates.

To generate realistic images, we fine-tuned partial STEM generators as part of GANs. GANs generate images with more realistic high-frequency spatial components and structure than MSE training. However, GANs focus on semantics; rather than intensity differences. This means that although adversarial completions have realistic characteristics, such as high-frequency noise, individual pixel values differ from true values. GANs can also be difficult to train 76, 77 , and training requires additional computation. Nevertheless, inference time is the same for adversarial and non-adversarial generators after training.

Encouragingly, ANNs are universal approximators 78 that can represent 79 the optimal mapping from partial scans with arbitrary accuracy. This overcomes the limitations of traditional algorithms where performance is fixed. If ANN performance is insufficient or surpassed by another method, training or development can be continued to achieve higher performance. Indeed, validation errors did not diverge from training errors during our experiments, so we are presenting lower bounds for performance. In this paper, we compare spiral STEM performance against DLSS. It is the only method that we can rigorously and quantitatively compare against as it used the same test set data. This yielded a new insight into how signals being imaged above their Nyquist rates may affect performance discussed two paragraphs earlier, and highlights the importance of standardized datasets like our new STEM images dataset. As machine learning becomes more established in the electron microscopy community, we hope that standardized datasets will also become established to standardize performance benchmarks.

Detailed conclusions Partial STEM with deep learning can decrease electron dose and scan time by over an order of magnitude with minimal information loss. In addition, realistic STEM images can be completed by fine-tuning generators as part of a GAN. Detailed MSE characteristics are provided for multiple coverages, including MSEs per output pixel for 1/20 px coverage spiral scans. Partial STEM will enable new beam sensitive applications, so we have made our source code, new STEM dataset, pre-trained models, and details of experiments available to encourage further investigation. High performance is achieved by the introduction of an auxiliary trainer network, and adaptive learning rate clipping of high losses. We expect our results to be generalizable to SEM and other scan systems.

New STEM datasets are available on our publicly accessible dataserver 40, 41 . Source code for ANNs and to create images is in a GitHub repository with links to pre-trained models 36 Discriminator architecture is shown in Fig. S1 . Generator and inner generator trainer architecture is shown in Fig. S2 . The components in our networks are Bilinear Downsamp, wxw: This is an extension of linear interpolation in one dimension to two dimensions. It is used to downsample images to w×w.

Bilinear Upsamp, xs: This is an extension of linear interpolation in one dimension to two dimensions. It is used to upsample images by a factor of s.

Conv d, wxw, Stride, x: Convolutional layer with a square kernel of width, w, that outputs d feature channels. If the stride is specified, convolutions are only applied to every xth spatial element of their input, rather than to every element. Striding is not applied depthwise.

Linear, d: Flatten input and fully connect it to d feature channels.

Random Crop, wxw: Randomly sample a w×w spatial location using an external probability distribution. + : Circled plus signs indicate residual connections where incoming tensors are added together. These help reduce signal attenuation and allow the network to learn perturbative transformations more easily.

All generator convolutions are followed by running mean-only batch normalization then ReLU activation, except output convolutions. All discriminator convolutions are followed by slope 0.2 leaky ReLU activation. Figure S2 . Two-stage generator that completes 512×512 micrographs from partial scans. A dashed line indicates that the same image is input to the inner and outer generator. Large scale features developed by the inner generator are locally enhanced by the outer generator and turned into images. An auxiliary trainer network restores images from inner generator features to provide direct feedback. This figure was created with Inkscape 1 .

Optimizer: Training is ADAM 2 optimized and has two halves. In the first half, the generator and auxiliary trainer learn to minimize mean squared errors between their outputs and ground truth images. For the quarter of iterations, we use a constant learning rate η 0 = 0.0003 and a decay rate for the first moment of the momentum β 1 = 0.9. The learning rate is then stepwise decayed to zero in eight steps over the second quarter of iterations. Similarly, β 1 is stepwise linearly decayed to 0.5 in eight steps. In an optional second half, the generator and discriminators play an adversarial game conditioned on MSE guidance. For the third quarter of iterations, we use η = 0.0001 and β 1 = 0.9 for the generator and discriminators. In the final quarter of iterations, the generator learning rate is decayed to zero in eight steps while the discriminator learning rate remains constant. Similarly, generator and discriminator β 1 is stepwise decayed to 0.5 in eight steps.

Experiments with GAN training hyperparameters show that β 1 = 0.5 is a good choice 3 . Our decision to start at β 1 = 0.9 aims to improve the initial rate of convergence. In the first stage, generator and auxiliary trainer parameters are both updated once per training step. In the second stage, all parameters are updated once per training step. In most of our initial experiments with burred masks, we used a total of 10 6 training iterations. However, we found that validation errors do not diverge if training time is increased to 2 × 10 6 iterations, and used this number for experiments with binary masks. These training iterations are in-line with with other GANs, which reuse datasets containing a few thousand examples for 200 epochs 4 . The lack of validation divergence suggests that performance may be substantially improved, and means that our results present lower bounds for performance. All training was performed with a batch size of 1 due to the large model size needed to complete 512×512 scans. Adaptive learning rate clipping: To stabilize batch size 1 training, adaptive learning rate clipping 5 (ALRC) was developed to limit high MSEs. ALRC layers were initialized with first raw moment µ 1 = 25, second raw moment µ 2 = 30, exponential decay rates β 1 = β 2 = 0.999, and n = 3 standard deviations. Input normalization: Partial scans, I scan , input to the generator are linearly transformed to I scan = (I scan + 1)/2, where I scan ∈ [0, 1]. The generator is trained to output ground truth crops in [0, 1] , which are linearly transformed to [−1, 1]. Generator outputs and ground truth crops in [−1, 1] are directly input to discriminators. Weight normalization: All generator parameters are weight normalized 6 . Running mean-only batch normalization 6, 7 is applied to the output channels of every convolutional layer, except the last. Channel means are tracked by exponential moving averages with decay rates of 0.99. Running mean-only batch normalization is frozen in the second half of training to improve stability 8 . Spectral normalization: Spectral normalization 3 is applied to the weights of each convolutional layer in the discriminators to limit the Lipschitz norms of the discriminators. We use the power iteration method with one iteration per training step to enforce a spectral norm of 1 for each weight matrix.

Spectral normalization stabilizes training, reduces susceptibility to mode collapse and is independent of rank, encouraging discriminators to use more input features to inform decisions 3 . In contrast, weight normalization 6 and Wasserstein weight clipping 9 impose more arbitrary model distributions that may only partially match the target distribution. Activation: In the generator, ReLU 10 non-linearities are applied after running mean-only batch normalization. In the discriminators, slope 0.2 leaky ReLU 11 non-linearities are applied after every convolutional layer. Rectifier leakage encourages discriminators to use more features to inform decisions. Our choice of generator and discriminator non-linearities follows recent work on high-resolution conditional GANs 4 . Initialization: Generator weights were initialized from a normal distribution with mean 0.00 and standard deviation 0.05. To apply weight normalization, an example scan is then propagated through the network. Each layer output is divided by its L2 norm and the layer weights assigned their division by the square root of the L2 normalized output's standard deviation. There are no biases in the generator as running mean-only batch normalization would allow biases to grow unbounded c.f. batch normalization 12 .

Discriminator weights were initialized from a normal distribution with mean 0.00 and standard deviation 0.03. Discriminator biases were zero initialized. Experience replay: To reduce destabilizing discriminator oscillations 13 , we used an experience replay 14 

In this section, we present learning curves for some of our non-adversarial architecture and learning policy experiments. During training, each training set example was reused ∼8 times. In comparison, some generative adversarial networks (GANs) are trained on the same data hundreds of times 4 . As a result, we did not experience noticeable overfitting. In cases where final S3/S16 errors are similar; so that their difference is not significant within the error of a single experiment, we choose the lowest error approach. In practice, choices between similar errors are unlikely to have a substantial effect on performance. Each experiment took a few days with an Nvidia GTX 1080 Ti GPU. All learning curves are 2500 iteration boxcar averaged. In addition, the first 10 4 iterations before dashed lines in figures, where losses rapidly decrease, are not shown.

Following previous work on high-resolution GANs 4 , we used a multi-stage training protocol for our initial experiments. The outer generator was trained separately; after the inner generator, before fine-tuning the inner and outer generator together. An alternative approach uses an auxiliary loss network for end-to-end training, similar to Inception 17, 18 . This can provide a more direct path for gradients to back-propagate to the start of the network and introduces an additional regularization mechanism. Experimenting, we connected an auxiliary trainer to the inner generator and trained the network in a single stage. As shown by Fig. S3a , auxiliary network supported end-to-end training is more stable and converges to lower errors.

In encoder-decoders, residual connections 19 between strided convolutions and symmetric strided transpositional convolutions can be used to reduce information loss. This is common in noise removal networks where the output is similar to the input 20, 21 . However, symmetric residual connections are also used in encoder-decoder networks for semantic image segmentation 22 where the input and output are different. Consequently, we tried adding symmetric residual connections between strided and transpositional inner generator convolutions. As shown by Fig. S3b , extra residuals accelerate initial inner generator training. However, final errors are slightly higher and initial inner generator training converged to similar errors with and without symmetric residuals. Taken together, this suggests that symmetric residuals initially accelerate training by enabling the final inner generator layers to generate crude outputs though their direct connections to the first inner generator layers. However, the symmetric connections also provide a direct path for low-information outputs of the first layers to get to the final layers, obscuring the contribution of the inner generator's skip-3 residual blocks (section S1) and lowering performance in the final stages of training.

Path information is concatenated to the partial scan input to the generator. In principle, the generator can infer electron beam paths from partial scans. However, the input signal is attenuated as it travels through the network 23 . In addition, path information would have to be deduced; rather than informing calculations in the first inner generator layers, decreasing efficiency. To compensate, paths used to generate partial scans from full scans are concatenated to inputs. As shown by Fig. S3b , concatenating path information reduces errors throughout training. Performance might be further improved by explicitly building sparsity into the network 24 .

Large convolutional kernels are often used at the start of neural networks to increase their receptive field. This allows their first convolutions to be used more efficiently. The receptive field can also be increased by increasing network depth, which could also enable more efficient representation of some functions 25 . However, increasing network depth can also increase information loss 23 and representation efficiency may not be limiting. As shown by Fig. S3c , errors are lower for small first convolution kernels; 3×3 for the inner generator and 7×7 for the outer generator or both 3×3, than for large first convolution kernels; 7×7 for the inner generator and 17×17 for the outer generator. This suggests that the generator does not make effective use of the larger 17×17 kernel receptive field and that the variability of the extra kernel parameters harms learning.

Learning curves for different learning rate schedules are shown in Fig. S3d . Increasing training iterations and doubling the learning rate from 0.0002 to 0.0004 lowers errors. Validation errors do not plateau for 10 6 iterations in Fig. S3e , suggesting that continued training would improve performance. In our experiments, validation errors were calculated after every 50 training iterations.

The choice of output domain can affect performance. Training with a [0, 1] output domain is compared against [−1, 1] for slope 0.01 leaky ReLU activation after every generator convolution in Fig. S3f . Although [−1, 1] is supported by leaky ReLUs, requiring orders of magnitude differences in scale for [−1, 0) and (0, 1] hinders learning. To decrease dependence on the choice output domain, we do not apply batch normalization or activation after the last generator convolutions in our final architecture.

The [0, 1] outputs of Fig. S3f were linearly transformed to [−1, 1] and passed through a tanh non-linearity. This ensured that [0, 1] output errors were on the same scale as [−1, 1] output errors, maintaining the same effective learning rate. Initially, outputs were clipped by a tanh non-linearity to limit outputs far from the target domain from perturbing training. However, Fig. S4a shows that errors are similar without end non-linearites so they were removed. Fig. S4a also shows that replacing slope 0.01 leaky ReLUs with ReLUs and changing all kernel sizes to 3×3 has little effect. Swapping to ReLUs and 3×3 kernels is therefore an option to reduce computation. Nevertheless, we continue to use larger kernels throughout as we think they would usefully increase the receptive field with more stable, larger batch size training.

To more efficiently use the first generator convolutions, we nearest neighbour infilled partial scans. As shown by Fig. S4b , infilling reduces error. However, infilling is expected to be of limited use for low-dose applications as scans can be noisy, making meaningful infilling difficult. Nevertheless, nearest neighbour partial scan infilling is a computationally inexpensive method to improve generator performance for high-dose applications.

To investigate our generator's ability to handle STEM noise 26 , we combined uniform noise with partial scans of Gaussian blurred STEM images. More noise was added to low intensity path segments and low-intensity pixels. As shown by Fig. S4c,   S4 /S16 Figure S3 . Learning curves. a) Training with an auxiliary inner generator trainer stabilizes training, and converges to lower than two-stage training with fine tuning. b) Concatenating beam path information to inputs decreases losses. Adding symmetric residual connections between strided inner generator convolutions and transpositional convolutions increases losses. c) Increasing sizes of the first inner and outer generator convolutional kernels does not decrease losses. d) Losses are lower after more interations, and a learning rate (LR) of 0.0004; rather than 0.0002. Labels indicate inner generator iterations -outer generator iterations -fine tuning iterations, and k denotes multiplication by 1000 e) Adaptive learning rate clipped quartic validation losses have not diverged from training losses after 10 6 iterations. f) Losses are lower for outputs in [0, 1] than for outputs in [-1, 1] if leaky ReLU activation is applied to generator outputs. S5/S16 Figure S4 . Learning curves. a) Making all convolutional kernels 3×3, and not applying leaky ReLU activation to generator outputs does not increase losses. b) Nearest neighbour infilling decreases losses. Noise was not added to low duration path segments for this experiment. c) Losses are similar whether or not extra noise is added to low-duration path segments. d) Learning is more stable and converges to lower errors at lower learning rates (LRs). Losses are lower for spirals than grid-like paths, and lowest when no noise is added to low-intensity path segments. e) Adaptive momentum-based optimizers, ADAM and RMSProp, outperform non-adaptive momentum optimizers, including Nesterov-accelerated momentum. ADAM outperforms RMSProp; however, training hyperparameters and learning protocols were tuned for ADAM. Momentum values were 0.9. f) Increasing partial scan pixel coverages listed in the legend decreases losses. S6/S16 Figure S5 . Adaptive learning rate clipping stabilizes learning, accelerates convergence and results in lower errors than Huberisation. Weighting pixel errors with their running or final mean errors is ineffective.

ablating extra noise for low-duration path segments increases performance. Fig. S4d shows that spiral path training is more stable and reaches lower errors at lower learning rates. At the same learning rate, spiral paths converge to lower errors than grid-like paths as spirals have more uniform coverage. Errors are much lower for spiral paths when both intensity-and duration-dependent noise is ablated.

To choose a training optimizer, we completed training with stochastic gradient descent, momentum, Nesterov-accelerated momentum 27, 28 , RMSProp 29 and ADAM 2 . Learning curves are in Fig. S4e . Adaptive momentum optimizers, ADAM and RMSProp, outperform the non-adaptive optimizers. Non-adaptive momentum-based optimizers outperform momentumless stochastic gradient decent. ADAM slightly outperforms RMSProp; however, architecture and learning policy were tuned for ADAM. This suggests that RMSProp optimization may also be a good choice.

Learning curves for 1/10, 1/20, 1/40 and 1/100 px coverage spiral scans are shown in Fig. S4f . In practice, 1/20 px coverage is sufficient for most STEM images. On average, a non-adversarial generator can complete test set 1/20 px coverage partial scans with a 2.6% root mean squared intensity error. Nevertheless, higher coverage is needed to resolve fine detail in some images. Likewise, lower coverage may be appropriate for images without fine detail. Consequently, we are developing an intelligent scan system that adjusts coverage based on micrograph content.

Training is performed with a batch size of 1 due to the large network size needed for 512×512 partial scans. However, MSE training is unstable and large error spikes destabilize training. To stabilize learning, we developed adaptive learning rate clipping 5 (ALRC) to limit magnitudes of high losses while preserving their initial gradient distributions. ALRC is compared against MSE, Huberised MSE, and weighting each pixel's error by its Huberised running mean, and fixed final errors in Fig. S5 . ALRC results in more stable training with the fastest convergence and lowest errors. Similar improvements have been confirmed for CIFAR-10 and STEM supersampling with ALRC 5 .

Sheets of examples comparing non-adversarial generator outputs and true images are shown in Fig. S6 -S12 for 512×512 spiral scans selected with binary masks. True images are blurred by a 5×5 symmetric Gaussian kernel with a 2.5 px standard deviation so that they are the same as the images that generators were trained output. Images are blurred to suppress high-frequency noise. Examples are presented for 1/17.9, 1/27.3, 1/38.2, 1/50.0, 1/60.5, 1/73.7, and 1/87.0 px coverage, in that order, so that higher errors become apparent for decreasing coverage with increasing page number. Quantitative performance characteristics for each generator are provided in the main article.

2XWSXW %OXUUHG7UXWK Figure S6 . Non-adversarial 512×512 outputs and blurred true images for 1/17.9 px coverage spiral scans selected with binary masks.

2XWSXW %OXUUHG7UXWK 2XWSXW %OXUUHG7UXWK Figure S7 . Non-adversarial 512×512 outputs and blurred true images for 1/27.3 px coverage spiral scans selected with binary masks.

2XWSXW %OXUUHG7UXWK 2XWSXW %OXUUHG7UXWK Figure S8 . Non-adversarial 512×512 outputs and blurred true images for 1/38.2 px coverage spiral scans selected with binary masks. S10/S16 2XWSXW %OXUUHG7UXWK 2XWSXW %OXUUHG7UXWK Figure S9 . Non-adversarial 512×512 outputs and blurred true images for 1/50.0 px coverage spiral scans selected with binary masks. S11/S16 2XWSXW %OXUUHG7UXWK 2XWSXW %OXUUHG7UXWK Figure S10 . Non-adversarial 512×512 outputs and blurred true images for 1/60.5 px coverage spiral scans selected with binary masks. S12/S16 2XWSXW %OXUUHG7UXWK 2XWSXW %OXUUHG7UXWK Figure S11 . Non-adversarial 512×512 outputs and blurred true images for 1/73.7 px coverage spiral scans selected with binary masks. S13/S16 2XWSXW %OXUUHG7UXWK 2XWSXW %OXUUHG7UXWK Figure S12 . Non-adversarial 512×512 outputs and blurred true images for 1/87.0 px coverage spiral scans selected with binary masks. S14/S16

There are amendments or corrections to the paper 4 covered by this chapter.

Location: Reference 13 in the bibliography. 

This chapter covers our paper titled "Partial Scanning Transmission Electron Microscopy with Deep Learning" 4 and associated research outputs 10, 15, [18] [19] [20] [21] 188 , which were summarized by Bethany Connolly 189 A third investigation into compressed sensing with a fixed random grid of probing locations was not published as I think that uniformly spaced grid scans are easier to implement on most scan systems. Further, reconstruction errors were usually similar for uniformly spaced and fixed random grids with the same coverage. Nevertheless, a paper I drafted on fixed random grids is openly accessible 190 . Overall, I think that compressed sensing with DNNs is a promising approach to reduce electron beam damage and scan time by 10-100× with minimal information loss.

My comparison of spiral and uniformly spaced grid scans with the same ANN architecture, learning policy and training data indicates that errors are lower for uniformly spaced grids. However, the comparison is not conclusive as ANNs were trained for a few days, rather than until validation errors plateaued. Further, a fair comparison is difficult as suitability of architectures and learning policies may vary for different scan paths. Higher performance of uniformly spaced grids can be explained by content at the focus of most electron micrographs being imaged at 5-10×

its Nyquist rate 2 (ch. 2). It follows that high-frequency information that is accessible from neighbouring pixels in contiguous scans is often almost redundant. Overall, I think the best approach may combine both contiguous and uniform spaced grid scans. For example, a contiguous scan ANN could exploit high-frequency information to complete an image, which could then be mapped to a higher resolution image by an ANN for uniformly spaced scans. Indeed, functionality for contiguous and uniformly spaced grid scans could be combined into a single ANN.

Most STEM scan systems can raster uniformly spaced grids of probing locations. However, scan systems often have to be modified to perform spiral or other custom scans 191, 192 

Most scan systems sample signals at sequences of discrete probing locations. Examples include atomic force microscopy 1, 2 , computerized axial tomography 3, 4 , electron backscatter diffraction 5 , scanning electron microscopy 6 , scanning Raman spectroscopy 7 , scanning transmission electron microscopy 8 (STEM) and X-ray diffraction spectroscopy 9 . In STEM, the high current density of electron probes produces radiation damage in many materials, limiting the range and types of investigations that can be performed 10, 11 . In addition, most STEM signals are oversampled 12 to ease visual inspection and decrease sub-Nyquist artefacts 13 . As a result, a variety of compressed sensing 14 algorithms have been developed to enable decreased STEM probing 15 .

In this paper, we introduce a new approach to STEM compressed sensing where a scan system is trained to piecewise adapt partial scans 16 to specimens by deep reinforcement learning 17 (RL). Established compressed sensing strategies include random sampling [18] [19] [20] , uniformly spaced sampling 19, [21] [22] [23] , sampling based on a model of a sample 24, 25 , partials scans with fixed paths 16 , dynamic sampling to minimize entropy [26] [27] [28] [29] and dynamic sampling based on supervised learning 30 . Complete signals can be extrapolated from partial scans by an infilling algorithm, estimating their fast Fourier transforms 31 or inferred by an artificial neural network 16, 23 (ANN). In general, the best sampling strategy varies for different specimens. For example, uniformly spaced sampling is often better than spiral paths for oversampled STEM images 16 . However, sampling strategies designed by humans usually have limited ability to leverage an understanding of physics to optimize sampling. As proposed by our earlier work 16 , we have therefore developed ANNs to dynamically adapt scan paths to specimens. Expected performance of dynamic scans can always match or surpass expected performance of static scans as static scan paths are a special case of dynamic scan paths.

Exploration of STEM specimens is a finite-horizon partially observed Markov decision process 32, 33 (POMDP) with sparse losses: A partial scan can be constructed from path segments sampled at each step of the POMDP and a loss can be based on the quality of an scan completion generated from the partial scan with an ANN. Most scan systems support custom scan paths or can be augmented with a field programmable gate array 34, 35 (FPGA) to support custom scan paths. However, there is a delay before a scan system can execute or is ready to receive a new command. Total latency can be reduced by using both fewer and larger steps, and decreasing steps may also reduce distortions due to cumulative errors in probing positions 34 after commands are executed. Command execution can also be delayed by ANN inference. However, inference delay can be minimized by using a computationally lightweight ANN and inferring future commands while previous commands are executing.

Markov decision processes (MDPs) can be optimized by recurrent neural networks (RNNs) based on long short-term memory 36, 37 (LSTM), gated recurrent unit 38 (GRU), or other cells [39] [40] [41] . LSTMs and GRUs are popular as they solve the vanishing gradient problem 42 and have consistently high performance 40 are often applied to MDPs as they can learn to extract and remember state information to inform future decisions. To solve dynamic graphs, an RNN can be augmented with dynamic external memory to create a differentiable neural computer 43 (DNC). To optimize a MDP, a discounted future loss, Q t , at step t in a MDP with T steps can be calculated from step losses, L t , with Bellman's equation,

where γ ∈ [0, 1) discounts future step losses. Equations for RL are often presented in terms of rewards, e.g. r t = −L t ; however, losses are an equivalent representation that avoids complicating our equations with minus signs. Discounted future loss backpropagation through time 44 48 , and playing score-based computer games 49, 50 . Actors can be trained with non-differentiable losses by introducing a differentiable surrogate 51 or critic 52 to predict losses that can be backpropagated to actor parameters. Alternatively, non-differentiable losses can be backpropagated to agent parameters if actions are sampled from a differentiable probability distribution 46, 53 as training losses given by products of losses and sampling probabilities are differentiable. There are also a variety of alternatives to gradient descent, such as simulated annealing 54 and evolutionary algorithms 55 , that do not require differentiable loss functions. Such alternatives can outperform gradient descent 56 ; however, they usually achieve similar or lower performance than gradient descent for deep ANN training.

In this section, we outline our training environment, ANN architecture and learning policy. Our ANNs were developed in Python with TensorFlow 57 . Detailed architecture and learning policy is in supplementary information. In addition, source code and pretrained models are openly accessible from GitHub 58 , and training data is openly accessible 12, 59 .

To create partial scans from STEM images, an actor, µ, infers action unit vectors, µ(h t ), based on a history, h t = (o i 1 , a 1 , ..., o t , a t ), of previous actions, a, and observations, o. To encourage exploration, µ(h t ) is rotated to a t by Ornstein-Uhlenbeck 60 (OU) exploration noise 61 , ε t ,

where we chose θ = 0.1 to decay noise to ε avg = 0, a scale factor, σ = 0.2, to scale a standard normal variate, W , and start noise ε 0 = 0. OU noise is linearly decayed to zero throughout training. Correlated OU exploration noise is recommended for continuous control tasks optimized by deep deterministic policy gradients 49 (DDPG) and recurrent deterministic policy gradients 50 (RDPG). Nevertheless, follow-up experiments with twin delayed deep deterministic policy gradients 62 (TD3) and distributed distributional deep deterministic policy gradients 63 (D4PG) have found that uncorrelated Gaussian noise can produce similar results. An action, a t , is the direction to move to observe a path segment, o t , from the position at the end of the previous path segment. Partial scans are constructed from complete histories of actions and observations, h T . A simplified partial scan is shown in figure 1 . In our experiments, partial scans, s, are constructed from T = 20 straight path segments selected from 96×96 STEM images. Each segment has 20 probing positions separated by d = 2 1/2 px and positions can be outside an image. The pixels in the image nearest each probing position are sampled, so a separation of d ≥ 2 1/2 simplied development by preventing successive probing positions in a segment from sampling the same pixel. A separation of d < 2 1/2 would allow a pixel to sampled more than once by moving diagonally, potentially incentivising orthogonal scan motion to sample more pixels.

Following our earlier work 16, 23, 64 , we select subsets of pixels from STEM images to create partial scans to train ANNs for compressed sensing. Selecting a subset of pixels is easier than preparing a large, carefully partitioned and representative dataset 65, 66 containing experimental partial scan and full image pairs, and selected pixels have realistic noise characteristics as they are from experimental images. However, selecting a subset of pixels does not account for probing location errors varying with scan shape 34 . We use a Warwick Electron Microscopy Dataset (WEMD) containing 19769 32-bit 96×96 images cropped and downsampled from full images 12, 59 . Cropped images were blurred by a symmetric 5×5 Gaussian kernel with a 2.5 px 2/13 standard deviation to decrease any training loss variation due to varying noise characteristics. Finally, images, I, were linearly transformed to normalized images, I N , with minimum and maximum values of −1 and 1. To test performance, the 19769 images were split, without shuffling, into a training set containing 15815 images and a test set containing 3954 images.

For training, our adaptive scan system consists of an actor, µ, target actor, µ , critic, Q, target critic, Q , and generator, G. To minimize latency, our actors and critics are computationally inexpensive deep LSTMs 67 with a depth of 2 and 256 hidden units. Our generator is a convolutional neural network 68, 69 (CNN) . A recurrent actor selects actions, a t and observes path segments, o t , that are added to an experience replay 70 , R, containing 10 5 sequences of actions and observations, h T = (o 1 , a 1 , ..., o T , a T ). Partial scans, s, are constructed from histories sampled from the replay to train a generator to complete partial scans, I i G = G(s i ). The actor and generator cooperate to minimize generator losses, L G , and are the only networks needed for inference.

Generator losses are not differentiable w.r.t. actor actions used to construct partial scans i.e. ∂ L G /∂ a t = 0. Following RDPG 50 , we therefore introduce recurrent critics to predict losses from actor actions and observations that can be backpropagated to actors for training by BPTT. Actor and critic RNNs have the same architecture, except actors have two outputs to parameterize actions whereas critics have one output to predict losses. Target networks 49, 71 use exponential moving averages of live actor and critic network parameters and are introduced to stabilize learning. For training by RDPG, live and target ANNs separately replay experiences. However, we propagate live RNN states to target RNNs at each step as a precaution against any cumulative divergence of target network behaviour from live network behaviour across multiple steps.

To train actors to cooperate with a generator to complete partial scans, we developed cooperative recurrent deterministic policy gradients (CRDPG, algorithm 1). This is an extension of RDPG to an actor that cooperates with another ANN to minimize its loss. We train our networks by ADAM 72 optimized gradient descent for M = 10 6 iterations with a batch size, N = 32. We use constant learning rates η µ = 0.0005 and η Q = 0.0010 for the actor and critic, respectively. For the generator, we use an initial learning rate η G = 0.0030 with an exponential decay factor of 0.75 5m/M at iteration m. The exponential decay envelope is multiplied by a sawtooth cyclic learning rate 73 with a period of 2M/9 that oscillates between 0.2 and 1.0. Training takes two days with an Intel i7-6700 CPU and an Nvidia GTX 1080 Ti GPU.

We augment training data by a factor of eight by applying a random combination of flips and 90 • rotations, mapping s → s and I N → I N , similar to our earlier work 16, 23, 64, 74 . Our generator is trained to minimize mean squared errors (MSEs),

between scan completions, G(s ), and normalized target images, I N . Generator losses decrease during training as the generator learns, and may vary due to loss spikes 64 Initialize actor, µ, critic, Q, and generator, G, networks with parameters ω, θ and φ , respectively. Initialize target networks, µ and Q , with parameters ω ← ω, θ ← θ , respectively. Initialize replay buffer, R. Initialize average generator loss, L avg . for iteration m = 1, M do Initialize empty history, h 0 . 

where the Kronecker delta, δ tT , is 1 if t = T and 0 otherwise, and clip(L i G ) is the smaller of L i G and three standard deviations above its running mean. Compute target values, (y i 1 , ..., y i T ), with target networks,

where H i Q and H i µ are states of live networks after computing Q(h i t , a i t ) and µ(h i t ), respectively. Compute critic update (using BPTT),

Compute actor update (using BPTT),

Compute generator update,

Update the actor, critic and generator by gradient descent. Update the target networks and average generator loss,

end for

improve RL 75 , so we divide generator losses used for critic training by their running mean,

where we chose β L = 0.997 and L avg is updated at each training iteration. Heuristically, an optimal policy does not go over image edges as there is no information there in our training environment. To accelerate convergence, we therefore added a small loss penalty, E t = 0.1, at step t if an action results in a probing position being over an image edge. The total loss at each step is

where clip(L G ) clips losses used for RL to three standard deviations above their running mean. This adaptive loss clipping is inspired by adaptive learning rate clipping 64 (ALRC) and reduces learning destabilization by high loss spikes. However, we expect that clipping normalized losses to a fixed threshold 71 would achieve similar results. The Kronecker delta, δ tT , in equation 14 is 1 if t = T and 0 otherwise, so it only adds the generator loss at the final step, T .

To estimate discounted future losses, Q rl t , for RL, we use a target actor and critic,

where we chose γ = 0.97. Target networks stabilize learning and decrease policy oscillations [76] [77] [78] . The critic is trained to minimize mean squared differences, L Q , between predicted and target losses, and the actor is trained to minimize losses, L µ , predicted by the critic,

Our target actor and critic have trainable parameters ω and θ , respectively, that track live parameters, ω and θ , by soft updates 49 ,

where we chose β ω = β θ = 0.9997. We also investigated hard updates 71 , where target networks are periodically copied from live networks; however, we found that soft updates result in faster convergence and more stable training.

In this section, we present examples of adaptive partial scans and select learning curves for architecture and learning policy experiments. Examples of 1/23.04 px coverage partial scans, target outputs and generator completions are shown in figure 2 for 96×96 crops from test set STEM images. They show both adaptive and spiral scans after flips and rotations to augment data for the generator. The first actions select a path segment from the middle of image in the direction of a corner. Actors then use the first and following observations to inform where to sample the remaining T − 1 = 19 path segments. Actors adapt scan paths to specimens. For example, if an image contains regular atoms, an actor might cover a large area to see if there is a region where that changes. Alternatively, if an image contains a uniform region, actors, may explore near image edges and far away from the uniform region to find region boundaries. The main limitation of our experiments is that generators trained to complete a variety of partial scan paths generated by an actor achieves lower performance than a generate trained to complete partial scans with a fixed path. For example, figure 3 (a) shows that generators trained to cooperate with LSTM or GRU actors are outperformed by generators trained with fixed spiral or other scan paths shown in figure 3(b) . Spiral paths outperform fixed scan paths; however, we emphasize that paths generated by actors are designed for individual training data, rather than all training data. Freezing actor training to prevent changes in actor policy does not result in clear improvements in generator performance. Consequently, we think that improvements to generator architecture or learning policy should be a starting point for further investigation. To find the best practical actor 5/13 Figure 2 . Test set 1/23.04 px coverage partial scans, target outputs and generated partial scan completions for 96×96 crops from STEM images. The top four rows show adaptive scans, and the bottom row shows spiral scans. Input partial scans are noisy, whereas target outputs are blurred. policy, we think that a generator trained for a variety of scan paths should achieve comparable performance to generators trained for single scan paths.

We investigated a variety of popular RNN architectures to minimize inference time. Learning curves in figure 3 (a) show that performance is similar for LSTMs and GRUs. GRUs require less computation. However, LSTM and GRU inference time is comparable and GRU training seems to be more prone to loss spikes, so LSTMs may be preferable. We also created a DNC by augmenting a deep LSTM with dynamic external memory. However, figure 3(c) shows that LSTM and DNC performance is similar, and inference time and computational requirements are much higher for our DNC. We tried to reduce computation and accelerate convergence by applying projection layers to LSTM hidden states 79 . However, we found that performance decreased with decreasing projection layer size.

Experienced replay buffers for RL often have heuristic sizes, such as 10 6 examples. However, RL can be sensitive to replay buffer size 70 . Indeed, learning curves in figure 3(d) show that increasing buffer size improves learning stability and decreases test set errors. Increasing buffer size usually improves learning stability and decreases forgetting by exposing actors and critics to a higher variety of past policies. However, we expect that convergence would be slowed if the buffer became too large as increasing buffer size increases expected time before experiences with new policies are replayed. We also found that increasing 6/13 Figure 3 . Learning curves for a-b) adaptive scan paths chosen by an LSTM or GRU, and fixed spiral and other fixed paths, c) adaptive paths chosen by an LSTM or DNC, d) a range of replay buffer sizes, e) a range of penalties for trying to sample at probing positions over image edges, and f) with and without normalizing or clipping generator losses used for critic training. All learning curves are 2500 iteration boxcar averaged and results in different plots are not directly comparable due to varying experiment settings. Means and standard deviations of test set errors, "Test: Mean, Std Dev", are at the ends of labels in graph legends.

buffer sized decreased the size of small loss oscillations [76] [77] [78] , which have a period near 2000 iterations. However, the size of loss oscillations does not appear to affect performance.

We found that initial convergence is usually delayed if a large portion of initial actions go outside the imaging region. This would often delay convergence by about 10 4 iterations before OU noise led to the discovery of better exploration strategies away from image edges. Although 10 4 iterations is only 1% of our 10 6 iteration learning policy, it often impaired development by delaying debugging or evaluation of changes to architecture and learning policy. Augmenting RL losses with subgoal-based heuristic rewards can accelerate convergence by making problems more tractable 80 . Thus, we added loss penalties if actors tried to go over image edges, which accelerated initial convergence. Learning curves in figure 3 (e) show that over edge penalties at each step smaller than E t = 0.2 have a similar effect on performance. Further, performance is lower for higher over edge penalties, E t ≥ 0.2. We also found that training is more stable if over edge penalties are added at individual steps, rather than propagated to past steps as part of a discounted future loss.

Our actor, critic and generator are trained together. It follows that generator losses, which our critic learns to predict, decrease throughout training as generator performance improves. However, normalizing loss sizes usually improves RL 75 , so we divide by their running means in equation 14. Learning curves in figure 3 (f) show that loss normalization improves learning stability and decreases final errors. Clipping training losses can improve RL 71 , so we clipped generator losses used for critic training to 3 standard deviations above their running means. We found that clipping increases test set errors, possibly because most training errors are in a similar regime. Thus, we expect that clipping may be more helpful for training with sparser scans as higher uncertainty may increase likelihood of unusually high generator losses.

The main limitation of our adaptive scan system is that generator errors are much higher when a generator is trained for a variety of scan paths than when it is trained for a single scan path. However, we expect that generator performance for a variety of scans could be improved to match performance for single scans by developing a larger neural network with a better learning policy. To train actors to cooperate with generators, we developed CRDPG. This is an extension of RDPG 50 , and RDPG is based on DDPG 49 . Alternatives to DDPG, such as TD3 62 and D4PG 63 , arguably achieve higher performance, so we expect that they could form the basis of a future training algorithm. Further, we expect that architecture and learning policy could be improved by AdaNet 81 , Ludwig 82 , or other automatic machine learning [83] [84] [85] [86] [87] (AutoML) algorithms as AutoML can often match or surpass the performance of human developers 88, 89 . Finally, test set losses for a variety of scans appear to be decreasing at the end of training, so we expect that performance could be improved by increasing training iterations.

After generator performance is improved, we expect the main limitation of our adaptive scan system to be distortions caused by probing position errors. Errors usually depend on scan path shape 34 and accumulate for each path segment. Non-linear scan distortions can be corrected by comparing pairs of orthogonal raster scans 90, 91 , and we expect this method can be extended to partial scans. However, orthogonal scanning would complicate measurement by limiting scan paths to two half scans to avoid doubling electron dose on beam-sensitive materials. Instead, we propose that a cyclic generator 92 could be trained to correct scan distortions and provide a detailed method as supplementary information 93 . Another limitation is that our generators do not learn to correct STEM noise 94 . However, we expect that generators can learn to remove noise, for example, from single noisy examples 95 or by supervised learning 74 .

To simplify our preliminary investigation, our scan system samples straight path segments and cannot go outside a specified imaging region. However, actors could learn to output actions with additional degrees of freedom to describe curves, multiple successive path segments, or sequences of non-contiguous probing positions. Similarly, additional restrictions could be applied to actions. For example, actions could be restricted to avoid actions that cause high probing position errors. Training environments could also be modified to allow actors to sample pixels over image edges by loading images larger than partial scan regions. In practice, actors can sample outside a scan region and being able to access extra information outside an imaging region could improve performance. However, using larger images may slow development by increasing data loading and processing times.

Not all scan systems support non-raster scan paths. However, many scan controllers can be augmented with an FPGA to enable custom scan paths 34, 35 . Recent versions of Gatan DigitalMicrograph support Python 96 , so our ANNs can be readily integrated into existing scan systems. Alternatively, an actor could be synthesized on a scan-controlling FPGA 97, 98 to minimize inference time. There could be hundreds of path segments in a partial scan, so computationally lightweight and parallelizable actors are essential to minimize scan time. We have therefore developed actors based computationally inexpensive RNNs, which can remember state information to inform future decisions. Another approach is to update a partial scan at each step to be input to feedforward neural network (FNN), such as a CNN, to decide actions. However, we expect that FNNs are less practical than RNNs as FNNs may require additional computation to reprocess all past states at each step. 8/13 186 

Our initial investigation demonstrates that actor RNNs can be trained by RL to direct piecewise adaption of contiguous scans to specimens for compressed sensing. We introduce CRDPG to train an RNN to cooperate with a CNN to complete STEM images from partial scans and present our learning policy, experiments, and example applications. After further development, we expect that adaptive scans will become the most effective approach to decrease electron beam damage and scan time with minimal information loss. Static sampling strategies are a subset of possible dynamic sampling strategies, so the performance of static sampling can always be matched by or outperformed by dynamic sampling. Further, we expect that adaptive scan systems can be developed for most areas of science and technology, including for the reduction of medical radiation. To encourage further investigation, our source code, pretrained models, and training data is openly accessible.

Supplementary information is openly accessible at https://doi.org/10.5281/zenodo.4384708. Therein, we present detailed ANN architecture, additional experiments and example scans, and a new method to correct partial scan distortions.

The data that support the findings of this study are openly available. 

Detailed actor, critic and generator architecture is shown in figure S1. Actors and critics have almost identical architecture, except actor fully connected layers output action vectors whereas critic fully connected layers output predicted losses. In most of our experiments, actors and critics are deep LSTMs 1 . However, we also augment deep LSTMs with dynamic external memory to create DNCs 2 in some of our experiments. Configuration details of actor and critic components shown in figure S1(a) follow.

A two-layer deep LSTM with 256 hidden units in each layer. To reduce signal attenuation, we add skip connections from inputs to the second LSTM layer and from the first LSTM layer to outputs. Weights are initialized from truncated normal distributions and biases are zero initialized. In addition, we add a bias of 1 to the forget gate to reduce forgetting at the start of training 3 . Initial LSTM cell and hidden states are initialized with trainable variables 4 .

Our DNC implementation is adapted from Google Deepmind's 2, 5 . We use 4 read heads and 1 S1 write head to control access to dynamic external memory, which has 16 slots with a word size of 64. Fully Connected: A dense layer linearly connects inputs to outputs. Weights are initialized from a truncated normal distribution and there are no biases. Conv d, wxw, Stride, x: Convolutional layer with a square kernel of width, w, that outputs d feature channels. If the stride is specified, convolutions are only applied to every xth spatial element of their input, rather than to every element. Striding is not applied depthwise. Trans Conv d, wxw, Stride, x: Transpositional convolutional layer with a square kernel of width, w, that outputs d feature channels. If the stride is specified, convolutions are only applied to every xth spatial element of their input, rather than to every element. Striding is not applied depthwise. + : Circled plus signs indicate residual connections where incoming tensors are added together. Residuals help reduce signal attenuation and allow a network to learn perturbative transformations more easily.

The actor and critic cooperate with a convolutional generator, shown in figure S1(b), to complete partial scans. Our generator is constructed from convolutional layers 6 and skip-3 residual blocks 7 . Each convolutional layer is followed by ReLU 8 activation then batch normalization 9 , and residual connections are added between activation and batch normalization. The convolutional weights are Xavier 10 initialized and biases are zero initialized.

We apply L2 regularization 11 to decay generator parameters by a factor, β = 0.99999, at each training iteration. This decay rate is heuristic and the L2 regularization is primarily a precaution against overfitting. Further, adding L2 regularization did not have a noticeable effect on performance. We also investigated gradient clipping [12] [13] [14] [15] to a range of static and dynamic thresholds for actor and critic training. However, we found that gradient clipping decreases convergence if clipping thresholds are too small and otherwise does not have a noticeable effect.

This section present additional learning curves for architecture and learning policy experiments in figure S2 . For example, learning curves in figure S2 (a) show that generator training with an exponentially decayed cyclic learning rate 16 results in faster convergence and lower final errors than just using an exponentially decayed learning rate. We were concerned that a cyclic learning rate might cause generator loss oscillations if the learning rate oscillated too high. Indeed, our investigation of loss normalization was, in part, to prevent potential generator loss oscillations from destabilizing critic training. However, our learning policy results in generator losses that steadily decay throughout training.

To train actors by BPTT, we differentiate losses predicted by critics w.r.t. actor parameters by the chain rule,

An alternative approach is to replace ∂ Q(h i t , a i t )/∂ µ(h i t ) with a derivative w.r.t. replayed actions, ∂ Q(h i t , a i t )/∂ a i t . This is equivalent to adding noise, stop_gradient(a i t − µ(h i t )), to an actor action, µ(h i t ), where stop_gradient(x) is a function that stops gradient backpropagation to x. However, learning curves in figure S2 (b) show that differentiation w.r.t. live actor actions results in faster convergence to lower losses. Results for ∂ Q(h i t , a i t )/∂ a i t are similar if OU exploration noise is doubled. Most STEM signals are imaged at several times their Nyquist rates 17 . To investigate adaptive STEM performance on signals imaged close to their Nyquist rates, we downsampled STEM images to 96×96. Learning curves in figure S2 (c) show that losses are lower for oversampled STEM crops. Following, we investigated if MSEs vary for training with different loss metrics by adding a Sobel loss, λ S L S , to generator losses. Our Sobel loss is

where S(x) computes a channelwise concatenation of horizontal and vertical Sobel derivatives 18 of x, and we chose λ S = 0.1 to weight the contribution of L S to the total generator loss, L G + λ S L S . Learning curves in figure S2 (c) show that Sobel losses do not decrease training MSEs for STEM crops. However, Sobel losses decrease MSEs for downsampled STEM images. This motivates the exploration of alternative loss functions 19 to further improve performance. For example, our earlier work shows that generator training as part of a generative adversarial network [20] [21] [22] [23] (GAN) can improve STEM image realism 24 . Similarly, we expect that generated image realism could be improved by training generators with perceptual losses 25 . After we found that adding a Sobel loss can decrease MSEs, we also experimented with other loss functions, such as the maximum MSE of 5×5 regions. Learning curves in figure S2(d) show that MSEs result in faster convergence than maximum S2/S10 Figure S2 . Learning curves for a) exponentially decayed and exponentially decayed cyclic learning rate schedules, b) actor training with differentiation w.r.t. live or replayed actions, c) images downsampled or cropped from full images to 96×96 with and without additional Sobel losses, d) mean squared error and maximum regional mean squared error loss functions, e) supervision throughout training, supervision only at the start, and no supervision, and f) projection from 128 to 64 hidden units or no projection. All learning curves are 2500 iteration boxcar averaged, and results in different plots are not directly comparable due to varying experiment settings. Means and standard deviations of test set errors, "Test: Mean, Std Dev", are at the ends of graph labels. S3/S10 Figure S3 . Learning rate optimization. a) Learning rates are increased from 10 −6.5 to 10 0.5 for ADAM and SGD optimization. At the start, convergence is fast for both optimizers. Learning with SGD becomes unstable at learning rates around 2.2×10 −5 , and numerically unstable near 5.8×10 −4 , whereas ADAM becomes unstable around 2.5×10 −2 . b) Training with ADAM optimization for learning rates listed in the legend. Learning is visibly unstable at learning rates of 2.5×10 −2.5 and 2.5×10 −2 , and the lowest inset validation loss is for a learning rate of 2.5×10 −3.5 . Learning curves in (b) are 1000 iteration boxcar averaged. Means and standard deviations of test set errors, "Test: Mean, Std Dev", are at the ends of graph labels. region losses; however, both loss functions result in similar final MSEs. We expect that MSEs calculated with every output pixel result in faster convergence than maximum region errors as more pixels inform gradient calculations. In any case, we expect that a better approach to minimize maximum errors is to use a higher order loss function, such as mean quartic errors. If training with a higher-order loss function is unstable, it might be stabilized by adaptive learning rate clipping 26 .

Target losses can be directly computed with Bellman's equation, rather than with target networks. We refer to such directly computed target losses as "supervised" losses,

where where γ ∈ [0, 1) discounts future step losses, L t . Learning curves for full supervision, supervision linearly decayed to zero in the first 10 5 iterations, and no supervision are shown in figure S2 (e). Overall, final errors are similar for training with and without supervision. However, we find that learning is usually more stable without supervised losses. As a result, we do not recommend using supervised losses.

To accelerate convergence and decrease computation, an LSTM with n h hidden units can be augmented by a linear projection layer with n p < 3n h /4 units 27 . Learning curves in figure S2(f) are for n h = 128 and compare training with a projection to n p = 64 units and no projection. Adding a projecting layer increases the initial rate of convergence; however, it also increases final losses. Further, we found that training becomes increasingly prone to instability as n p is decreased. As a result, we do not use projection layers in our actor or critic networks.

Generator learning rate optimization is shown in figure S3 . To find the best initial learning rate for ADAM optimization, we increased the learning rate until training became unstable, as shown in figure S3(a) . We performed the learning rate sweep over 10 4 iterations to avoid results being complicated by losses rapidly decreasing in the first couple of thousand. The best learning rate was then selected by training for 10 5 iterations with learning rates within a factor of 10 from a learning rate 10× lower than where training became unstable, as shown in figure S3(b) . We performed initial learning rate sweeps in figure S3 (a) for both ADAM and stochastic gradient descent 28 (SGD) optimization. We chose ADAM as it is less sensitive to hyperparameter choices than SGD and because ADAM is recommended in the RDPG paper 29 .

Test set errors are computed for 3954 test set images. Most test set errors are similar to or slightly higher than training set errors. However, training with fixed paths, which is shown in figure 3 (a) of the main article, results in high divergence of test and training set errors. We attribute this divergence to the generator overfitting to complete large regions that are not covered by S4/S10 fixed scan paths. In comparison, our learning policy was optimized for training with a variety of adaptive scan paths where overfitting is minimal. After all 10 6 training iterations, means and standard deviations (mean, std dev) of test set errors for fixed paths 2, 3 and 4 are (0.170, 0.182), (0.135, 0.133) , and (0.171, 0.184). Instead, we report lower test set errors of (0.106, 0.090), (0.073, 0.045), and (0.106. 0.090), respectively, at 5 × 10 5 training iterations, which correspond to early stopping 30, 31 . All other test set errors were computed after final training iterations.

A limitation of partial STEM is that images are usually distorted by probing position errors, which vary with scan path shape 32 . Distortions in raster scans can be corrected by comparing series of images 33, 34 . However, distortion correction of adaptive scans is complicated by more complicated scan path shapes and microscope-specific actor command execution characteristics. We expect that command execution characteristics are almost static. Thus, it follows that there is a bijective mapping between probing locations in distorted adaptive partial scans and raster scans. Subsequently, we propose that distortions could be corrected by a cyclic generative adversarial network 35 (GAN) . To be clear, this section outlines a possible starting point for future research that can be refined or improved upon. The method's main limitation is that the cyclic GAN would need to be trained or fine-tuned for individual scan systems.

Let I partial and I raster be unpaired partial scans and raster scans, respectively. A binary mask, M, can be constructed to be 1 at nominal probing positions in I partial and 0 elsewhere. We introduce generators G p→r (I partial ) and G r→p (I raster , M) to map from partial scans to raster scans and from raster scans to partial scans, respectively. A mask must be input to the partial scan generator for it to output a partial scan with a realistic distortion field as distortions depend on scan path shape 32 . Finally, we introduce discriminators D partial and D raster are trained to distinguish between real and generated partial scans and raster scans, respectively, and predict losses that can be used to train generators to create realistic images. In short, partial scans could be mapped to raster scans by minimizing

where L p→r and L p→r are total losses to optimize G p→r and G p→r , respectively. A scalar, b, balances adversarial and cycle-consistency losses.

Additional sheets of test set adaptive scans are shown in figure S4 and figure S5 . In addition, a sheet of test set spiral scans is shown in figure S6 . Target outputs were low-pass filtered by a 5×5 symmetric Gaussian kernel with a 2.5 px standard deviation to suppress high-frequency noise.

S5/S10 Figure S4 . Test set 1/23.04 px coverage adaptive partial scans, target outputs, and generated partial scan completions for 96×96 crops from STEM images.

197 Figure S5 . Test set 1/23.04 px coverage adaptive partial scans, target outputs, and generated partial scan completions for 96×96 crops from STEM images.

S7/S10 Figure S6 . Test set 1/23.04 px coverage spiral partial scans, target outputs, and generated partial scan completions for 96×96 crops from STEM images.

S8/S10

This chapter covers my paper titled "Adaptive Partial Scanning Transmission Electron Microscopy with Reinforcement Learning" 5 and associated research outputs 15, 22 . It presents an initial investigation into STEM compressed sensing with contiguous scans that are piecewise adapted to specimens. Adaptive scanning is a finite-horizon partially observed Markov decision process 212 There are a variety of additional refinements that could improve training. As an example, RNN computation is delayed by calling a Python function to observe each path segment. Delay could be reduced by more efficient sampling e.g. by using a parallelized routine coded in C/C++; by selecting several possible path segments in advance and selecting the segment that most closely corresponds to an action; or by choosing actions at least one step in advance rather than at each step. In addition, it may help if the generator undergoes additional training iterations in parallel to actor and critic training as improving the generator is critical to improving performance. Finally, increasing generator training iterations may result in overfitting, so it may help to train generators as part of a GAN or introduce other regularization mechanisms. For context, I find that adversarial training can reduce validation divergence 7 (ch. 7) and produce more realistic partial scan completions 4 (ch. 4).

Many imaging modes in electron microscopy are limited by noise [1] . Increasingly, ever more sophisticated and expensive hardware and software based methods are being developed to increase resolution, including aberration correctors [2, 3] , advanced cold field emission guns [4, 5] , holography [6, 7] and others [8] [9] [10] . However, techniques that produce low signals [9] , or are low-dose to reduce beam damage [11] are fundamentally limited by the signal-to-noise ratios in the micrographs they produce.

Many general [12] and electron microscopy-specific [1, 13] denoising algorithms have been developed. However, most of these algorithms rely on hand-crafted filters and are rarely, if ever, fully optimized for their target domains [14] . Neural networks are universal approximators [15] that overcome these difficulties [16] through representation learning [17] . As a result, networks are increasingly being applied to noise removal [18] [19] [20] [21] and other applications in electron microscopy [22] [23] [24] [25] .

Image processing by convolutional neural networks (CNNs) takes the form a series of convolutions that are applied to the input image. While a single convolution may appear to be an almost trivial image processing tool, successive convolutions [26] can transform the data into different mappings. For example, a discrete Fourier transformation can be represented by a single-layer neural network with a linear transfer function [27] . The weightings in each convolution are effectively the parameters that link the neurons in each successive layer of the CNN and allow any conceivable image processing to be undertaken by a general CNN architecture, if trained appropriately. Training in this context means the use of some optimisation routine to adjust the weights of the many convolutions (often several thousand parameters) to minimise a loss function that compares the output image with a desired one and a generally applicable CNN requires training on tens of thousands of model images, which is a non-trivial task. The recent success of large neural networks in computer vision may be attributed to the advent of graphical processing unit (GPU) acceleration [28, 29] , particularly GPU acceleration of large CNNs [30, 31] (CNNs) in distributed settings [32, 33] , allowing this time-consuming training to be completed on acceptable timescales. Application of these techniques to electron microscopy may allow significant improvements in peformance, particularly in areas that are limited by signal-to-noise.

At the time of writing, there are no large CNNs for electron micrograph denoising. Instead, most denoising networks act on small overlapping crops e.g. [20] . This makes them computationally inefficient and unable to utilize all the information available. Some large denoising networks have been trained as part of generative adversarial networks [34] and try to generate images resembling high-quality training data as closely as possible. This can avoid the blurring effect of most filters by generating features that might be in high-quality micrographs. However, this means that they are prone to producing undesirable artefacts.

This paper presents the deep CNN in Fig. 1 for electron micrograph denoising. Our network architecture and training hyperparameters are similar to DeepLab3 [35] and DeepLab3+ [36] , with the modifications discussed in [37] . Briefly, image processing starts in a modified Xception [38] encoder, which spatially downsamples its 512 × 512 input to a 32 × 32 × 728 tensor. These high-level features flow into an atrous spatial pyramid pooling (ASPP) module [35, 36] that combines the outputs of atrous convolutions acting on different spatial scales into a 32 × 32 × 256 tensor. A multi-stage decoder then upsamples the rich ASPP semantics to a 512 × 512 output by combining them with lowlevel encoder features. This recombination with low-level features helps to reduce signal attenuaiton. For computational and parameter efficiency, most convolutions are depthwise separated into pointwise and depthwise convolutions [38] ; rather than standard convolutions.

An ideal training dataset might have a wide variety of images and zero noise, enabling the CNN to be trained by inputting artificially degraded images and comparing its output with the zero-noise image. Such datasets can only be produced by simulation (which may be a time-consuming task), or approximated by experimental data. Here, we used 17,267 electron micrographs saved to University of Warwick data servers by scores of scientists working on hundreds of projects over several years. The data set therefore has a diverse constitution, including for example phase contrast images of polymers, diffraction contrast images of semiconductors, high resolution lattice imaging of crystals and a small number of CBED patterns. it is comprised of 32-bit image collected on Gatan SC600 or SC1000 Orius cameras on JEOL 2000FX, 2100, 2100plus and ARM200F microscopes. Scanning TEM (STEM) images were not included. There are several contributions to noise from these charge-coupled device (CCD) cameras, which form an image of an optically coupled scintillator, including [39] : Poisson noise, dictated by the size of the detected signal; electrical readout and shot noise; systematic errors in dark reference, linearity, gain reference, dead pixels or dead columns of pixels (some, but not all, of which is typically corrected by averaging in the camera software); and X-ray noise, which results in individual pixels having extreme high or low values.

In order to minimize the effects of Poisson noise in this dataset we only included micrographs with mean counts per pixel above 2500. X-ray noise, typically affecting only 0.05-0.10% of pixels, was left uncorrected. Each micrograph was cropped to 2048 × 2048 and binned by a factor of two to 1024 × 1024. This increased the mean count per pixel to above 10,000, i.e. a signal-to-(Poisson)noise ratio above 100:1. The effects of systematic errors were mitigated by taking 512 × 512 crops at random positions followed by a random combination of flips and 90 ∘ rotations (in the process, augmenting the dataset by a factor of eight). Finally, each image was then scaled to have single-precision (32bit) pixel values between zero and one.

Our dataset was split into 11,350 training, 2431 validation and 3486 test micrographs. This was pipelined used the TensorFlow [33] deep learning framework to a replica network on each of a pair of Nvidia GTX 1080 Ti GPUs for training via ADAM [40] optimized synchronous stochastic gradient descent [32] .

To train the network for low doses, Poisson noise was applied to each 512 × 512 training image after multiplying it by a scale factor, effectively setting the dose in electrons per pixel for a camera with perfect detective quantum efficiency (DQE). These doses were generated by adding 25.0 to numbers, x, sampled from an exponential distribution with probability density function

where we chose = 75.0. To place this in context, the minimum dose in this training data is equivalent to only 25 e 2 for a camera with perfect DQE and pixel size size of 5 µm at 50,000 × . These numbers and distribution thus exposed the network to a continuous range of signal-tonoise ratios (most below 10:1) appropriate for typical low-dose electron microscopy [41] . After noise application, ground truth training images were scaled to have the same mean as their noisy counterparts.

After being trained for low-dose applications, the network was finetuned for high doses by training it on crops scaled by numbers uniformly distributed between 200 and 2500. That is, by scale factors for signal-to-noise ratios between 10 2 :1 and 50:1.

The learning curve for our network is shown in Fig. 2 . It was trained to minimize the mean squared error (MSE) between its denoised output and the original image before the addition of noise. To surpass our lowdose performance benchmarks, our network had to achieve a MSE lower than × 7.5 10 , 4 as tabulated in Table 1 . Consequently MSEs were scaled by 1000, limiting trainable parameter perturbations by MSEs Fig. 1 . Simplified network showing how features produced by an Xception backbone are processed. Complex high-level features flow into an atrous spatial pyramid pooling module that produces rich semantic information. This is combined with simple low-level features in a multi-stage decoder to resolve denoised micrographs. [50] .

Mean MSE and SSIM for several denoising methods applied to 20,000 instances of Poisson noise and their standard errors. All methods were implemented with default parameters. Gaussian: 3 × 3 kernel with a 0.8 px standard deviation. Bilateral: 9 × 9 kernel with radiometric and spatial scales of 75 (scales below 10 have little effect while scales above 150 cartoonize images). Median: 3 × 3 kernel. Wiener: no parameters. Wavelet: BayesShrink adaptive wavelet soft-thresholding with wavelet detail coefficient thresholds estimated using [56] . Chambolle and Bregman TV: iterative total-variation (TV) based denoising [57] [58] [59] , both with denoising weights of 0.1 and applied until the fractional change in their cost function fell below All neurons were ReLU6 [43] activated. Our experiments with other activations are discussed in [37] . Weights were Xavier uniform initialized [44] and biases were zero initialized. During training, L2 regularization [45] was applied by adding 5 × 10 5 times the quadrature sum of all trainable variables to the loss function. This prevented trainable parameters growing unbounded, decreasing their ability to learn in proportion [46] . Importantly, this ensures that our network continues to learn effectively if it is fine-tuned or given additional training. We did not perform an extensive search for our regularization rate and think that 5 × 10 5 may be too high.

Our network is allowed to produce outputs outside the range of the input image, i.e. [0.0,1.0]. However, outputs can be optionally clipped to this range during inference. Noisy images are expected to have more extreme values than restored images so clipping the restored images to [0.0,1.0] helps to safeguard against overly extreme outputs. Consequently, all performance statistics; including losses during training, are reported for clipped outputs.

We trained batch normalization layers from [47] with a decay rate of 0.999 until the instabilities introduced by their trainable parameters began to limit convergence. Then, after 134,108 batches, batch normalization was frozen. During training, batch normalization layers map features, y, using their means, μ and standard deviations, σ, and a small number, ε, to the normalized frames

Batch normalization has a number of advantages, including reducing covariate shift [47] and improving gradient stability [60] to decrease training time and improve accuracy. We found that batch normalization also seems to significantly reduced structured error variation in our output images (see Section 3).

ADAM [40] optimization was used throughout training with a stepped learning rate. For the low dose version of the network, we used a learning rate of 1.0 × 10 3 for 134,108 batches, 2.5 × 10 4 for another 17,713 batches and then 1.0 × 10 4 for 46,690 batches. The network was then fine-tuned for high doses using a learning rate of 2.5 × 10 4 for 16,773 batches, then 1.0 × 10 4 for 17,562 batches. These unusual intervals are a result of learning rates being adjusted at wall clock times.

We found the recommended [33, 40] ADAM decay rate for the first moment of the momentum, = 0.9, 1 to be too high and chose = 0.5 1 instead. This lower β 1 made training more responsive to varying noise levels in batches.

We designed our network to be trained end-to-end; rather than in stages, so that it is easy to fine-tune or retrain for other applications. This is important as multi-stage training regiments introduce additional hyperparameters and complexity that may make the network difficult to use in practice. Nevertheless, we expect it to be possible to achieve slightly higher performance by training components of our neural network in stages and then fine-tuning the whole network end-to-end. Multistage training to eek out slightly higher performance may be appropriate if our network is to be tasked upon a specific, performancecritical application.

To benchmark our network's performance, we applied it and eight popular denoising methods to 20,000 instances of noise applied to 512 × 512 test micrographs. Table 1 shows the results for both lowdose and high dose networks and data, giving the mean MSE and structural similarity index (SSIM) [62] for the denoised images compared with the original images before noise was added. The first row gives statistics for the unfiltered data, establishing a baseline. Our network outperforms all other methods using both metrics (N.B. SSIM is 1 for perceptually similar images; 0 for perceptually dissimilar). The improved performance can be seen in more detail in Fig. 3 , which shows performance probability density functions (PDFs) for the both low-and high-dose versions of our network. Notably, the fraction of images with a MSE above 0.002 is negligible for our low-dose neural network, while all other methods have a noticeable tail of difficult-tocorrect images that retain higher MSEs.

All methods produce much smaller MSEs for the high-dose data; however, a similar trend is present. The network consistently produces better results and has fewer images that have high errors. Interestingly, the mean squared error PDFs for the network appear to have two main modes: there is a sharp peak at 0.0002 and a second at 0.0008 in the MSE PDF plots of Fig. 3 . Similarly, a bimodal distribution is present in the high dose data. This may be due to different performance for different types of micrograph, perhaps reflecting the mixture of diffraction contrast and phase contrast images used in training and testing. If this is the case, it may be possible to improve performance significantly for specific applications by training on a narrower range of data.

Mean absolute errors of our network's output for 20,000 examples are shown in Fig. 4 . Absolute errors are almost uniformly low. They are only significantly higher near the edges of the output, as shown by the inset image showing 16 × 16 corner pixels. The mean absolute errors per pixel are 0.0177 and 0.0102 for low and high doses, respectively. Small, grid-like variations in absolute error are revealed by contrastlimited adaptive histogram equalization [61] in Fig. 4 . These variations are common in deep learning and are often associated with transpositional convolutions. Consequently, some authors [63] have recommended their replacement with bilinear upsampling followed by convolution. We tried this; however, we found that while it made the errors less grid-like, it did not change the absolute errors significantly. Instead, we found batch normalization to be a simple and effective way to reduce structured error variation, likely due to the regularizing effect of its instability. This is evident from the more grid-like errors in the high dose version of our network, which was trained for longer after batch normalization was frozen. More advanced methods that reduce structured error variation are discussed in [64] but were not applied here.

Example applications of our low-dose network being use to removed applied noise from high-quality 512 × 512 electron micrographs are shown in Fig. 5 . In practice, our program may be applied to arbitrarily large images by dividing them into slightly overlapping 512 × 512 crops that can be processed. Our code does this by default. Slightly overlapping crops allows the higher errors at the edges of the neural network output to be avoided, decreasing errors below the values we report. To reduce errors at image edges, where crops cannot be overlapped, we use reflection padding. Users can customize the amount overlap, padding and many other options or use default values.

The most successful conventional noise-reduction method applied to our data is the iterative Chambolle total variation algorithm, c.f. Fig. 3 , which takes more than four times the runtime of our neural network on our hardware. As part of development, we experimented with shallower architectures similar to [18, 20, 21] ; however, these networks could not surpass Chambolle's low-dose benchmark (Table 1) . Consequently, we switched to the deeper Xception-based architecture presented here.

Overall, our neural network demonstrates that deep learning is a promising avenue to improve low-dose electron microscopic imaging. While our network significantly outperforms Chambolle TV for our data, it still has the capacity to be improved through better learning protocols or further training for specific datasets. It is most useful in applications limited by noise, particularly biological low-dose applications, and tuning its performance for the noise characteristics of a specific dose, microscope and camera may be worthwhile for optimal performance. Further improvement of the encoder-decoder architecture may also be possible, producing further gains in performance. One of the advantages for network algorithms is their speed in comparison with other techniques. We speed-tested our network by applying it to 20,000 512 × 512 images with one external GTX 1080 Ti GPU and one thread of an i7-6700 processor. Once loaded, it has a mean worst-case (i.e. batch size 1) inference time of 77.0 ms, which means that it can readily be applied to large amounts of data. This compares favorably with the best conventional method on our data; Chambolle's, which has an average runtime of 313.6 ms.

We designed our network to have a high capacity so that it can discriminate between and learn from experiences in multiple domains. It has also been L2 regularized to keep its weights and biases low, ensuring that it will continue to learn effectively. This means that it is well-suited for further training to improve performance in other domains. Empirically, pre-training a model in a domain other than the target domain often improves performance. Consequently, we recommend the pretrained models we provide as a starting point to be fine-tuned for other domains.

Our original write-up of this work; which is less targeted at electron microscopists, is available as [37] . Our original preprint has more example applications to TEM and STEM images, a more detailed discussion of the architecture and additional experiments we did to refine it.

We have developed a deep neural network for electron micrograph denoising using a modified Xception backbone for encoding, an atrous spatial pyramid pooling module and a multi-stage decoder. We find that it outperforms existing methods for low and high electron doses. It is fast and easy to apply to arbitrarily large datasets. While our network generally performs well on most noisy images as-is, further optimization for specific applications is possible. We expect applications to be found in low-dose imaging, which is limited by noise.

Our code and pre-trained low-and high-dose models are available at: https://github.com/Jeffrey-Ede/Electron-Micrograph-Denoiser. 

There are amendments or corrections to the paper 6 covered by this chapter.

Location: Page 19 , text following eqn 1.

Change: "...to only 25 e −2 for a camera..." should say "...to only 25 eÅ −2 for a camera...".

Location: Page 21, first paragraph of performance section.

Change: "...structural similarity index (SSIM)..." should say "...structural similarity index measure (SSIM)...".

This chapter covers our paper titled "Improving Electron Micrograph Signal-to-Noise with an Atrous Convolutional Encoder-Decoder" 6 and associated research outputs 15 for a typical crystal system well-approximated by two Bloch waves [4] . Here ϕ is a distance between Bloch wavevectors, λ is the electron wavelength, ξ is an extinction distance for two Bloch waves, ... r denotes an average with respect to r, and Im(z) is the imaginary part of z. Other applications of ψ exit (r, z) [6] include information storage, point spread function deconvolution, improving contrast, aberration correction [7] , thickness measurement [8] , and electric and magnetic structure determination [9, 10] . Exit wavefunctions can also simplify comparison with simulations as no information is lost. In general, the intensity, I(S ), of a measurement with support, S , is

A support is a measurement region, such as an electron microscope camera [11, 12] element. Half of wavefunction information is lost at measurement as |ψ| 2 is a function of amplitude, A > 0, and not phase, θ ∈ [−π, π), We emphasize that we define A to be positive so that |ψ| 2 → A is bijective, and ψ sign information is in exp(iθ). Phase information loss is a limitation of conventional single image approaches to electron microscopy, including transmission electron microscopy [13] (TEM), scanning transmission electron microscopy [14] (STEM), and scanning electron microscopy [15] (SEM).

In the Abbe theory of wave optics [16] in fig. 1b , the projection of ψ to a complex spectrum, ψ dif (q), in reciprocal space, q, at the back focal plane of an objective lens can be described by a Fourier transform (FT) ψ dif (q) = FT[ψ exit (r)] = ψ exit (r) exp(−2πiq · r) dr.

(4) In practice, ψ dif (q) is perturbed to ψ pert by an objective aperture, E ap , coherence, E coh , chromatic aberration, E chr , and lens aberrations, χ, and is described in the Fourier domain [1] by ψ pert (q) = E ap (q)E coh (q)E chr (q) exp(−iχ(q))ψ dif (q)

where E ap (q) =        1, for |q| ≤ kθ max 0, for |q| > kθ max (6) E coh (q) = exp − (∇χ(q)) 2 (kθ coh ) 2 4 ln(2) (7)

C n,m,a θ n+1 cos(mφ) n + 1 + C n,m,b θ n+1 sin(mφ) n + 1 (9) for an objective aperture with angular extent, θ max , illumination aperture with angular extent, θ coh , energy spread, ∆E, chromatic aberration coefficient of the objective lens, C c , relativistically corrected acceleration voltage, U * a , aberration coefficients, C n,m,a and C n,m,b , angular inclination of perturbed wavefronts to the optical axis, φ, angular position in a plane perpendicular to the optical axis, θ, m, n ∈ N 0 , and m + n is odd.

All waves emanating from points in Fourier space interfere in the image plane to produce an image wave, ψ img (r), mathematically described by an inverse Fourier transform (FT −1 ) ψ img (r) = FT −1 (ψ pert (q)) = ψ pert (q) exp(2πiq · r) dq. (10) Information transfer from ψ exit to measured intensities can be modified by changing χ. Typically, by controlling the focus of the objective lens. However, half of ψ exit information is missing from each measurement.

To overcome this limitation, a wavefunction can be iteratively fitted to a series of aligned images with different χ [17, 18, 19, 20] . However, collecting an image series, waiting for sample drift to decay, and iterative fitting delays each ψ exit measurement. As a result, aberration series reconstruction is unsuitable for live exit wavefunction reconstruction.

Electron holography [1, 18, 21] is an alternative approach to exit wavefunction reconstruction that compares ψ exit to a reference wave. Typically, a hologram, I hol , is created by moving a material off-axis and introducing an electrostatic biprism after the objective aperture. The Fourier transform of a 2 216 Möllenstedt biprismatic hologram is [1] FT(I hol (r)) = FT(1 + |ψ exit (r)| 2 ) + µFT(ψ exit (r)) ⊗ δ(q − q c ) + µFT(ψ * exit (r)) ⊗ δ(q + q c ), (11) where ψ * exit (r) is the complex conjugate of ψ exit (r), |q c | is the carrier frequency of interference fringes, and their contrast,

is given by source spatiotemporal coherence, µ coh , inelastic interactions, µ inst , instabilities, µ inst , and the modulation transfer function [22] , MTF, of a detector.

Convolutions with Dirac δ in eqn. 11 describe sidebands in Fourier space that can be cropped, centered, and inverse Fourier transformed for live exit wavefunction reconstruction. However, off-axis holograms are susceptible to distortions and require meticulous microscope alignment as phase information is encoded in interference fringes [1] , and cropping Fourier space reduces resolution [21] .

Artificial neural networks (ANNs) have been trained to recover phases of optical holograms from single images [23] . In general, this is not possible as there are an infinite number of physically possible θ for a given A. However, ANNs are able to leverage an understanding of the physical world to recover θ if the distribution of possible holograms is restricted, for example, to biological cells. Non-iterative methods that do not use ANNs to recover phase information from single images have also been developed. However, they are limited to defocused images in the Fresnel regime [24] , or to non-planar incident wavefunctions in the Fraunhofer regime [25] .

One-shot phase recovery with ANNs overcomes the limitations of traditional methods: it is live, not susceptible to off-axis holographic distortions, does not require microscope modification, and can be applied to any imaging regime. In addition, ANNs could be applied to recover phases of images in large databases, long after samples may have been lost or destroyed. In this paper, we investigate the application of deep learning to one-shot exit wavefunction reconstruction in conventional transmission electron microscopy (CTEM).

To showcase one-shot exit wavefunction reconstruction, we generated 98340 exit wavefunctions with clTEM [27, 28] multislice propagation for 12789 CIFs [29] downloaded from the Crystallography Open Database [30, 31, 32, 33, 34, 35] (COD). Complex 64 bit 512×512 wavefunctions were simulated for CTEM with acceleration voltages in {80, 200, 300} kV, material depths along the optical axis uniformly distributed in [5, 100) nm, material widths perpendicular to the optical axis in [5, 10) nm, and crystallographic zone axes (h, k, l) h, k, l ∈ {0, 1, 2}. Materials are padded on all sides with 0.8 nm of vacuum in the image plane, and 0.3 nm along the optical axis, to reduce simulation artefacts. Finally, crystal tilts to each axis were perturbed by zero-centered Gaussian random variates with standard deviation 0.1 • . We used default values for other clTEM hyperparameters.

Multislice exit wavefunction simulations with clTEM are based on [36] . Simulations start with a planar wavefunction, ψ, travelling along a TEM column ψ (x, y, z) = exp 2πiz

where x and y are in-plane coordinates, and z is distance travelled. After passing through a thin specimen, with thickness ∆z, wavefunctions are approximated by ψ (x, y, z + ∆z) exp (iσV z (x, y) ∆z) ψ (x, y, z) (14) with

where V z is the projected potential of the specimen at z, m is relativistic electron mass, e is fundamental electron charge, and h is Planck's constant. For electrons propagating through a thicker specimen, cumulative phase change can described by a specimen transmission function, t(x, y, z), so that ψ (x, y, z + ∆z) = t (x, y, z) ψ (x, y, z)

with t (x, y, z) = exp

A thin sample can be divided into multiple thin slices stacked together using a propagator function, P, to map wavefunctions between slices. A wavefunction at slice n is mapped to a wavefunction at slice n + 1 by ψ n+1 (x, y) ← P (x, y, ∆z) ⊗ t n (x, y) ψ n (x, y, ) (18) where ψ 0 is the incident wave in eqn. 13 . Simulations with clTEM are based on OpenCL [37] , and use 3 217 Dataset   n  Train  Unseen  Validation  Test  Total  Multiple Materials  1  25325  1501  3569  8563  38958  Multiple Materials  3  24530  1544  3399  8395  37868  Multiple Materials, Restricted  3  8002  -1105 Table 1 : New datasets containing 98340 wavefunctions simulated with clTEM are split into training, unseen, validation, and test sets. Unseen wavefunctions are simulated for training set materials with different simulation hyperparameters. Kirkland potential summations were calculated with n = 3 or truncated to n = 1 terms, and dashes (-) indicate subsets that have not been simulated. Datasets have been made publicly available at [26] .

graphical processing units (GPUs) to accelerate fast Fourier transform [38] (FFT) based convolutions. The propagator is calculated in reciprocal space

where k x , k y are reciprocal space coordinates, and k = (k 2 x + k 2 y ) 1/2 . As Fourier transforms are used to map between reciprocal and real space, propagator and transmission functions are band limited to decrease aliasing.

Projected atomic potentials are calculated using Kirkland's parameterization [36] , where the projected potential of an atom at position, p, in a thin slice is 

where r p = [(x − x p ) 2 + (y − y p ) 2 ] 1/2 , x p and y p are the coordinates of the atom, r Bohr is the Bohr radius, K 0 is the modified Bessel function [39] , and the parameters a i , b i , c i , and d i are tabulated for each atom in [36] . Nominally, n = 3. However, we also use n = 1 to investigate robustness to alternative simulation physics.

In effect, simulations with n = 1 are for an alternative universe where atoms have different potentials. Every atom in a slice contributes to the total projected potential

After simulation, a 320×320 region was selected from the center of each wavefunction to remove edge artefacts. Each wavefunction was divided by its magnitude to prevent an ANN from inferring information from an absolute intensity scale. In practice, it is possible to measure an absolute scale; however, it is specific to a microscope and its configuration. To investigate ANN performance for multiple materials, we partitioned 12789 CIFs into training, validation, and test sets by journal of publication. There are 8639 training set CIFs: 150 New Journal of Chemistry, 1034 American Mineralogist, 1998 Journal of the American Chemical Society, and 5457 Inorganic Chemistry. In addition, there are 1216 validation set CIFs published in Physics and Chemistry of Materials, and 2927 test set CIFs published in Chemistry of Materials. Wavefunctions were simulated for three random sets of hyperparameters for each CIF, except for a small portion of examples that were discarded because CIF format or simulation hyperparameters were unsupported. Partitioning by journal helps to test the ability of an ANN to generalize given that wavefunction characteristics are expected to vary with journal.

New simulated wavefunction datasets are tabulated in table 1 and have been made publicly available at [26] . In total, 76826 wavefunction have been simulated for multiple materials. To investigate ANN performance as 4 218 the distribution of possible wavefunctions is restricted, we also simulated 11870 wavefunctions with smaller simulation hyperparameter upper bounds that reduce ranges by factors close to 1/4. In addition, we simulated 9644 wavefunctions for a randomly selected single material, In 1.7 K 2 Se 8 Sn 2.28 [40] , shown in fig. 2 . Datasets were simulated for Kirkland potential summations in eqn. 20 to n = 3, or truncated to n = 1 terms. Truncating summations allows alternative simulation physics to be investigated.

To reconstruct an exit wavefunction, ψ exit , from its amplitude, A, an ANN must recover missing phase information, θ. However, θ ∈ [−∞, ∞], and restricting phase support to one period of the phase is complicated by cyclic periodicity. Instead, it is convenient to predict a periodic function of the phase with finite support. We use two output channels in fig. 3 to predict phase components, cos θ and sin θ, where ψ = A(cos θ+i sin θ).

Each convolutional layer [41, 42] is followed by batch normalization [43] , then activation, except the last layer where no activation is applied. Convolutional layers in residual blocks [44] are ReLU [45] activated, whereas slope 0.1 leaky ReLU [46] activation is used after other convolutional layers to avoid dying ReLU [47, 48, 49] . In denomination, channelwise L2 normalization imposes the identity | exp(iθ)| ≡ 1 after the final convolutional layer.

In initial experiments, batch normalization was frozen halfway through training, similar to [50] . However, scale invariance before L2 normalization resulted in numerical instability. As a result, we updated batch normalization parameters throughout training. Adding a secondary objective to impose a single output scale; such as a distance between mean L2 norms and unity, slowed training. Nevertheless, L2 normalization can be removed for generators that converge to low errors if | exp(iθ)| ≡ 1 is implicitly imposed by their loss functions.

For direct prediction, generators were trained by ADAM optimized [51] stochastic gradient descent [52, 53] for i max = 5 × 10 5 iterations to minimize adaptive learning rate clipped [54] (ALRC) mean squared errors (MSEs) of phase components. Training losses were calculated by multiplying MSEs by 10 and ALRC layers were initialized with first raw moment µ 1 = 25, second raw moment µ 2 = 30, exponential decay rates β 1 = β 2 = 0.999, and n = 3 standard deviations. We used an initial learning rate η 0 = 0.002, which was stepwise exponentially decayed [55] by a factor of 0.5 every i max /7 iterations, and a first moment of the momentum decay rate, β 1 = 0.9.

In practice, wavefunctions with similar amplitudes may make output phase components ambiguous. As a result, a MSE trained generator may predict a weighted mean of multiple probable phase outputs, even if it understands that one pair of phase components is more likely. To overcome this limitation, we propose training a generative adversarial network [56] (GAN) to predict most probable outputs. Specifically, we propose training a discriminator, D, in fig. 4 for a function, f , of amplitudes, and real and generated output phase components. This will enable an adversarial generator,

A discriminator predicts if wavefunction components were generated by a neural network.

G, to learn to output realistic phases in the context of their amplitudes.

There are many popular GAN loss functions and regularization mechanisms [57, 58] . Following [59] , we use mean squared generator, L G , and discriminator, L D , losses, and apply spectral normalization to the weights of every convolutional layer in the discriminator L D = (D( f (ψ)) − 1) 2 + D( f (G(|ψ|))) 2 (22) L G = (D( f (G(|ψ|)) − 1) 2 ,

where f is a function that parameterizes ψ as the channelwise concatenation of {A cos θ, A sin θ}. Multiplying generated phase components by inputted A conditions wavefunction discrimination on A, ensuring that the generator learns to output physically probable θ. Other parameterizations; such as the channelwise concatenation of {A, cos θ, sin θ} could also be used.

There are no biases in the discriminator. Concatenation of conditional information to discriminator inputs and feature channels is investigated in [60, 61, 62, 63, 64, 65, 66, 67] . Projection discriminators, which calculate inner products of generator outputs and conditional embeddings, are an alternative that achieve higher performance in [68] . However, blind compression to an embedded representation would reduce wavefunction information, potentially limiting the quality of generated wavefunctions, and may encourage catastrophic forgetting [69] .

Both generator and discriminator training was ADAM optimized for 5 × 10 5 iterations with base learning rate η G = η D = 0.0002, and first moment of the momentum decay, β 1 = 0.5. To balance generator and discriminator learning, we map the discriminator learning rate to

where µ D is the running mean discrimination for generated wavefunctions, D( f (G(|ψ|)), tracked by an exponential moving average with a decay rate of 0.99, and m = 20 and c = 0.5 linearly transform µ D .

To augment training data, we selected random w×w crops from 320×320 wavefunctions. Each crop was then subject to random combination of flips and π/2 rad rotations to augment our datasets by a factor of eight. We chose wavefunction size w = 224 for direct prediction and w = 144 for GANs, where w is smaller for GANs as discriminators add to GPU memory requirements. ANNs were trained with a batch size of 24.

In this section, we investigate phase recovery with ANNs as the distribution of wavefunctions is restricted. To directly predict θ for A, we trained ANNs for multiple materials, multiple materials with restricted simulation hyperparameters, and In 1.7 K 2 Se 8 Sn 2.28 . We also trained a GAN for In 1.7 K 2 Se 8 Sn 2.28 wavefunctions. Experiments are repeated with the summation in eqn. 20 truncated from n = 3 to n = 1, to demonstrate robustness to simulation physics.

Distributions of generated phase component mean absolute errors (MAEs) for sets of 19992 validation examples are shown in fig. 5 , and moments are tabulated in table 2. We used up to three validation sets, which cumulatively quantify the ability of a network to generalize to unseen transforms; combinations of flips, rotations and translations, simulation hyperparameters; such as thickness and voltage, and materials. In comparison, the expected error of the nth moment of phase components, E[|G(|ψ|) − f (θ)| n ], where g ∈ {cos, sin}, for uniform random predictions, x ∼ U(−1, 1), and uniformly distributed phases, θ ∼ U(−π, π), is E[|x − g(θ)| n ] = 1 −1 π −π ρ(x)ρ(θ)|x − g(θ)| n dθ dx, (25) where ρ(θ) = 1/2π and ρ(x) = 1/2 are uniform probability density functions for θ and x, respectively. 6 220 Figure 5 : Frequency distributions show 19992 validation set mean absolute errors for neural networks trained to reconstruct wavefunctions simulated for multiple materials, multiple materials with restricted simulation hyperparameters, and In 1.7 K 2 Se 8 Sn 2.28 . Networks for In 1.7 K 2 Se 8 Sn 2.28 were trained to predict phase components directly; minimising squared errors, and as part of generative adversarial networks. To demonstrate robustness to simulation physics, some validation set errors are shown for n = 1 and n = 3 simulation physics. We used up to three validation sets, which cumulatively quantify the ability of a network to generalize to unseen transforms; combinations of flips, rotations and translations, simulation hyperparameters; such as thickness and voltage, and materials. A vertical dashed line indicates an expected error of 0.75 for random phases, and frequencies are distributed across 100 bins. Table 2 : Means and standard deviations of 19992 validation set errors for unseen transforms (trans.), simulations hyperparameters (param.) and materials (mater.). All networks outperform a baseline uniform random phase generator for both n = 1 and n = 3 simulation physics. Dashes (-) indicate that validation set wavefunctions have not been simulated.

The first two moments are E[|x − g(θ)|] = 3/4 and E[|x − g(θ)| 2 ] = 5/6; making the expected standard deviation 0.520.

All ANN MAEs have lower means and standard deviations than a baseline random phase generator, except a In 1.7 K 2 Se 8 Sn 2.28 generator applied to other materials. ANNs do not have prior understanding of propagation equations or dynamics. As a result, experiments demonstrate that ANNs are able to develop and leverage a physical understanding to recover θ. ANNs are trained for Kirkland potential summations in eqn. 20 to n = 3 and n = 1 terms, demonstrating a robustness to simulation physics. Success with different simulation physics motivates the development of ANNs for real physics; approximated by n = 3 simulation physics.

Validation set MAEs increase as wavefunction restrictions are cumulatively reduced from unseen transforms used for data augmentation during training, to unseen simulation parameters, and unseen materials. For example, MAEs are 0.600 and 0.614 for ANNs trained for multiple materials, increasing to 0.708 and 0.768 for ANNs trained for In 1.7 K 2 Se 8 Sn 2. 28 . This shows that MAEs increase for materials an ANN is unfamiliar with, approaching MAEs of 0.75 expected for a uniform random phase generator where there is no familiarity.

Wavefunctions are insufficiently restricted for multiple materials. Validation MAEs of 0.333 and 0.513 for unseen transforms diverge to 0.600 and 0.614 for unseen simuation hyperparamaters and materials. In addition, a peak near 0.15 decreases, and MAE density around 0.75 increases. Taken together, this indicates that multiple material ANNs are able to recognise and generalize to some wavefunctions; however, their ability to generalize is limited. Further, frequency distribution tails exceed 0.75 for all validation sets. This may indicate that the generator struggles with material and simulation or hyperparameter combinations that produce wavefunctions with unusual or unpalatable characteristics.

However, we believe the tail is mainly caused by combinations that produce different wavefunctions with similar amplitudes.

Validation divergence decreases as the distribution of wavefunctions is restricted. For example, frequency distributions have almost no tail beyond 0.75 for simulation hyperparameter ranges reduced by factors close to 1/4. Validation divergence is also reduced by training for In 1.7 K 2 Se 8 Sn 2.28 , a single material. Restricting the distribution of wavefunctions is an essential part of one-shot wavefunction reconstruction, otherwise there is an infinite number of possible θ for A. To investigate an approach to reduce prediction weighting for A with a range of probable θ, we trained GANs for In 1.7 K 2 Se 8 Sn 2.28 . Training as part of a GAN acts as a regularization mechanism, lowering validation divergence. However, a GAN requires a powerful discriminator to understand the distribution of possible wavefunctions and can be difficult to train. In particular, n = 3 wavefunctions have lower local spatial correlation than n = 1 wavefunctions at our simulation resolution, which made it more difficult for our n = 3 GAN to learn.

Training loss distributions have tails with high losses. As a result, we used ALRC to limit high errors. A comparison of training with and without ALRC is in fig. 6 . Validation MAEs for unseen materials have mean 0.600 and standard deviation 0.334 with ALRC, and mean 0.602 and standard deviation 0.338 without ALRC. Differences between validation MAEs is insignificant, so ALRC is not helping for training with batch size 24. This behavior is in-line with results in the ALRC paper [54] , which shows that ALRC becomes less effective as batch size increases. Nevertheless, ALRC may be help lower error if generators are trained with smaller batch sizes. In particular, if the wavefunction distribution is restricted so errors are low, removing the need for L2 normalization at the end of the generator, and therefore decreasing dependence on batch normalization.

Examples of ANN phase recovery are shown in fig. 7 alongside crystal structures highlighting the structural information producing exit wavefunctions. Results are for unseen materials and an ANN trained for multiple materials with restricted simulation hyperparameters. Wavefunctions are presented for NaCl [70] and 8 222 Figure 7 : Exit wavefunction reconstruction for unseen NaCl, B 3 BeLaO 7 , PbZr 0.45 Ti 0.55 0 3 , CdTe, and Si input amplitudes, and corresponding crystal structures. Phases in [−π, π) rad are depicted on a linear greycale from black to white, and show that output phases are close to true phases. Wavefunctions are cyclically periodic functions of phase so distances between black and white pixels are small. Si is a failure case where phase information is not accurately recovered. Miller indices label projection directions. 9 223 elemental Si as they are simple materials with widely recognised structures.

Other materials belong to classes that are widely investigated: B 3 BeLaO 7 [71] is a non-linear optical crystal, PbZr 0.45 Ti 0.55 0 3 [72] is ferroelectric used in ultrasonic transducers [73] and ceramic capacitors [74] , and CdTe is a semiconductor used in solar cells [75] . The Si example is also included as typical failure case for unfamiliar examples. In this case, possibly because the Si crystal structure is unusually simple. Additional sheets of example input phases, generated phases, and true phases for each ANN will be provided as supplementary information with the published version of this preprint.

This paper describes an initial investigation into CTEM one-shot exit wavefunction reconstruction with deep learning, and is intended to be a starting point for future research. We expect that ANN architecture and learning policy can be substantially improved; possibly with AdaNet [76] , Ludwig [77] , or other automatic machine learning [78] algorithms, and we encourage further investigation. In this spirit, all of our source code [79] (based on TensorFlow [80] ), clTEM simulation software [27] , and new wavefunction datasets [26] have been made publicly available. Training for each network was stopped after a few days on an Nvidia 1080 Ti GPU, and losses were still decreasing. As a result, this paper presents lower bounds for performance.

To demonstrate robustness to simulation physics, Kirkland potential summations in eqn. 20 were calculated with n = 3, or truncated to n = 1 terms, for different datasets. For further simulations, compiled clTEM versions with n = 1 and n = 3 have been included in our project repository [79] . Source code for clTEM is also available with separate pre-releases [27] . Summations with n = 3 approximate experimental physics, whereas n = 1 is for an alternative universe with different atom potentials.

Our experiments do not include aberrations or detector noise.

This restricts the distribution of wavefunctions and makes it easier for ANNs to learn. However, distributions of wavefunctions were less restricted than possible in practice, and ANNs can remove noise [81] . As a result, we expect one-shot exit wavefunction to be applicable to experimental images. A good starting point for future research may be materials where the distribution of wavefunctions is naturally restricted. For example, graphene [82] and other two-dimensional materials [83] , select crystals at atomic resolution [84] , or classified images; such as biological specimens [85, 86] after similar preparation.

Information about materials, expected ranges of simulation hyperparameters, and other metadata was not input to ANNs. However, this variable information is readily available and could restrict the distribution of wavefunctions; improving ANN performance. Subsequently, we suggest that metadata embedded by an ANN could be used to modulate information transfer through a convolutional neural network by conditional batch normalization [87] .

However, metadata is typically high-dimensional, so this may be impractical beyond individual applications.

By default, large amounts of metadata is saved to Digital Micrograph image files (e.g. dm3 and dm4) created by Gatan Microscopy Suite [88] software. Metadata can also be saved to TIFFs [89] or other image formats preferred by electron microscopists using different software. In practice, most of this metadata describes microscope settings; such as voltage and magnification, and may not be sufficient to restrict the distribution of wavefunctions. Nevertheless, most file formats support the addition of extra metadata that is readily known to experimenters. Example information may include estimates for stoichiometry, specimen thickness, zone axis, temperature, the microscope and its likely aberration range, and phenomena exhibited by materials in scientific literature. ANNs have been developed to embed scientific literature [90] , so we expect that it will become possible to include additional metadata as a lay description.

In this paper, ANNs are trained to reconstruct ψ from A, and therefore follow a history of successful deep learning applications to accelerated quantum mechanics [91, 92] . In contrast, experimental holograms are integrated over detector supports.

Although probability density, |ψ(S )| 2 , at the mean support,S , can be factored outside the integral of eqn. 2 if spatial variation is small, ∇χ → 0, and S is effectively invariant,

these restrictions are unrealistic. In practice, we do not think the distinction is important as ANNs have learned to recover optical θ from I [23] .

To discourage ANNs from gaming their loss functions by predicting an average of probable phase components, we propose training GANs. However, GANs are difficult to train [93, 69] , and GAN training can take longer than with MSEs. For example, our validation set GAN MAEs are lower than for MSE 10 224 training after 5 × 10 5 iterations. We also found that GAN performance can be much lower for some wavefunctions; such as those with low local spatial correlation. High performance for large wavefunctions also requires powerful discriminators; such as [94] , to understand their distribution.

Overall, we expect GANs to become less useful the more a distribution of wavefunctions is restricted. As the distribution becomes more restricted, a smaller portion of the distribution has similar amplitudes with substantially different phases. In part, we expect this effect already lowers MAEs as distributions are restricted. Another contribution is restricted physics; which makes networks less reliant on identifying features. As a result, we expect the main use of GANs in phase recovery to be improving wavefunction realism.

We have simulated five new datasets containing 98340 CTEM exit wavefunctions with clTEM. The datasets have been used to train ANNs to reconstruct wavefunctions from single images. In this initial investigation, we found that ANN performance improves as the distribution of wavefunctions is restricted. One-shot exit wavefunction reconstruction overcomes the limitations of aberration series reconstruction and holography: it is live, does not require experimental equipment, and can be applied as a post-processing step indefinitely after an image is taken. We expect our results to be generalizable to other types of electron microscopy.

This work is intended to establish starting points to be improved on by future research. In this spirit, our new datasets [26] , clTEM simulation software [27] , and source code with links to pre-trained models [79] has been made publicly available.

In appendices, we build on Abbe's theory of wave optics to propose a new approach to phase recovery with deep learning. The idea is that wavefunctions could be learned from large datasets of single images; avoiding the difficulty and expense of collecting experimental wavefunctions. Nevertheless, we also introduce a new dataset containing 1000 512×512 experimental focal series. In addition, a supplementary document will be provided with the published version of this preprint with sheets of example input amplitudes, output phases, and true phases for every ANN featured in this paper. 237 Fig. S18 : GAN input amplitudes, target phases and output phases of 144×144 In1.7K2Se8Sn2.28 validation set wavefunctions for unseen simulation hyperparameters, and n = 3 simulation physics. 238 

This chapter covers our paper titled "Exit Wavefunction Reconstruction from Single Transmission Electron Micrographs with Deep Learning" 7 and associated research outputs 15, 25 . At the University of Warwick, EWR is usually based on iterative focal and tilt series reconstruction (FTSR), so a previous PhD student, Mark Dyson, GPU-accelerated FTSR 252 . However, both recording a series of electron micrographs and FTSR usually take several seconds, so FTSR is unsuitable for live EWR. We have an electrostatic biprism that can be used for live in-line holography [253] [254] [255] ; however, it is not used as we find that in-line holography is more difficult than FTSR. In addition, in-line holography can require expensive microscope modification if a microscope is not already equipped for it.

Thus, I was inspired by applications of DNNs to predict missing information for low-light vision 256, 257 to investigate live application of DNNs to predict missing phases of exit wavefunctions from single TEM images.

A couple of years ago, it was shown that DNNs can recover phases of exit wavefunctions from single optical micrographs if wavefunctions are constrained by limiting input variety [258] [259] [260] . Similarly, electron propagation can be described by wave optics 261 , and optical and electron microscopes have similar arrangements of optical and electromagnetic lenses, respectively 262 . Thus, it might be expected that DNNs can recover phases of exit wavefunctions from single TEM images. However, earlier experiments with optical micrographs were unbeknownst to us when we started our investigation. Thus, whether DNNs could reconstruct phase information from single TEM images was contentions as there are infinite possible phases for a given amplitude. Further, previous noniterative approaches to TEM EWR were limited to defocused images in the Fresnel regime 263 or non-planar incident wavefunctions in the Fraunhofer regime 264 .

We were not aware of any large openly accessible datasets containing experimental TEM exit wavefunctions.

Consequently, we simulated exit wavefunctions with clTEM 252,265 for a preliminary investigation. Similar to optical EWR 258-260 , we found that DNNs can recover the phases of TEM exit wavefunctions if wavefunction variety is restricted. Limitingly, our simulations are unrealistic insofar they do not include aberrations, specimen drift, statistical noise, and higher-order simulation physics. However, we have demonstrated that DNNs can learn to remove noise 6 (ch. 6), specimen drifted can be reduced by sample holders 266 , and aberrations can be minimized by aberration correctors 261, [267] [268] [269] . Moreover, our results present lower bounds for performance as our inputs were far less restricted than possible in practice.

Curating a dataset of experimental exit wavefunctions to train DNNs to recover their phases is time-consuming and expensive. Further, data curation became impractical due to a COVID-19 national lockdown in the United Kingdom 196 . Instead, we propose a new approach to EWR that uses metadata to inform DNN training with single images. Our TEM (ch. 6) and STEM (ch. 4) images in WEMD 2 are provided as a possible resource to investigate our proposal. However, metadata is not included in WEMD, which is problematic as performance is expected to increase with increasing metadata as increasing metadata increasingly restricts probable exit wavefunctions. Nevertheless, DNNs can reconstruct some metadata from unlabelled electron micrographs 270 . Another issue is that experimental WEMD contain images for a range of electron microscope configurations, which would complicate DNN training.

For example, experimental TEM images include bright field, dark field, diffraction and CBED images. However, data clustering could be applied to partially automate labelling of electron microscope configurations. For example, I provide pretrained VAEs to embed images for tSNE 2 (ch. 2).

This thesis covers a subset of my papers on advances in electron microscopy with deep learning. My review paper (ch. 1) offers a substantial introduction that sets my work in context. Ancillary chapters then introduce new machine learning datasets for electron microscopy (ch. 2) and an algorithm to prevent learning instabilty when training large neural networks with limited computational resources (ch. 3). Finally, we report applications of deep learning to compressed sensing in STEM with static (ch. 4) and dynamic (ch. 5) scans, improving TEM signal-to-noise (ch. 6), and TEM exit wavefunction reconstruction (ch. 7). This thesis therefore presents a substantial original contribution to knowledge which is, in practice, worthy of peer-reviewed publication. This thesis adds to my existing papers by presenting their relationships, reflections, and holistic conclusions. To encourage further investigation, source code, pretrained models, datasets, and other research outputs associated with this thesis are openly accessible.

Experiments presented in this thesis are based on unlabelled electron microscopy image data. Thus, this thesis demonstrates that large machine learning datasets can be valuable without needing to add enhancements, such as image-level or pixel-level labels, to data. Indeed, this thesis can be characterized as an investigation into applications of large unlabelled electron microscopy datasets. However, I expect that tSNE clustering based on my pretrained VAE encodings 2 (ch. 2) could ease image-level labelling for future investigations. Most areas of science are facing a reproducibility crisis 115 , including artificial intelligence 271 , which I think is partly due to a perceived lack of value in archiving data that has not been enhanced. However, this thesis demonstrates that unlabelled data can readily enable new applications of deep learning in electron microscopy. Thus, I hope that my research will encourage more extensive data archiving by the electron microscopy community.

My DNNs were developed with TensorFlow 272,273 and Python. In addition, recent versions of Gatan Microscopy Suite (GMS) software 274 , which is often used to drive electron microscopes, support Python 275 . Thus, my pretrained models and source code can be readily integrated into existing GMS software. If a microscope is operated by alternative software or an older version of GMS that does not support Python, TensorFlow supports many other programming languages 1 which can also interface with my pretrained models, and which may be more readily integrated. Alternatively, Python code can often be readily embedded in or executed by other programming languages.

To be clear, my DNNs were developed as part of an initial investigation of deep learning in electron microscopy.

Thus, this thesis presents lower bounds for performance that may be improved upon by refining ANN architecture and learning policy. Nevertheless, my pretrained models can be the initial basis of deep learning software for electron microscopy.

This thesis includes a variety of experiments to refine ANN architecture and learning policy. As AutoML [245] [246] [247] [248] [249] has improved since the start of my PhD, I expect that human involvement can be reduced in future investigations of standard architecture and learning policy variations. However, AutoML is yet to be able to routinely develop new approaches to machine learning, such as VAE encoding normalization and regularization 2 (ch. 2) and ALRC 3 240 (ch. 3). Most machine learning experts do not think that a technological singularity, where machines outrightly surpasses human developers, is likely for at least a couple of decades 276 . Nonetheless, our increasingly creative machines are already automating some aspects of software development 277, 278 and can programmatically describe ANNs 279 . Subsequently, I encourage adoption of creative software, like AutoML, to ease development.

Perhaps the most exciting aspect of ANNs is their scalability 280, 281 . Once an ANN has been trained, clones of the ANN and supporting software can be deployed on many electron microscopes at little or no additional cost to the developer. All machine learning software comes with technical debt 282, 283 ; however, software maintenance costs are usually far lower than the cost of electron microscopes. Thus, machine learning may be a promising means to cheaply enhance electron microscopes. As an example, my experiments indicate that compressed sensing ANNs 4 (ch. 4) can increase STEM and other electron microscopy resolution by up to 10× with minimal information loss. Such a resolution increase could greatly reduce the cost of electron microscopes while maintaining similar capability. Further, I anticipate that multiple ANNs offering a variety of functionality can be combined into a singleor multiple-ANN system that simultaneously offers a variety of enhancements, including increased resolution, decreased noise 6 (ch. 6), and phase information 7 (ch. 6).

I think the main limitation of this thesis, and deep learning, is that it is difficult to fairly compare different approaches to DNN development. As an example, I found that STEM compressed sensing with regularly spaced scans outperforms contiguous scans for the same ANN architecture and learning policy 4 (ch. 4). However, such a performance comparison is complicated by sensitivity of performance to training data, architecture, and learning policy. As a case in point, I argued that contiguous scans could outperform spiral scans if STEM images were not oversampled 4 , which could be the case if partial STEM ANNs are also trained to increase image resolution. In part, I think ANN development is an art: Most ANN architecture and learning policy is guided by heuristics, and best approaches to maximize performance are chosen by natural selection 284 . Due to the complicated nature of most data, maximum performances that can be achieved with deep learning are not known. However, it follows from the universal approximator theorem 233-241 that minimum errors can, in principle, be achieved by DNNs.

Applying an ANN to a full image usually requires less computation than applying an ANN to multiple image crops. Processing full images avoids repeated calculations if crops overlap 6 (ch. 6) or lower performance near crop edges where there is less information 4,6,19 (ch. 4 and ch. 6). However, it is usually impractical to train large DNNs to process full electron microscopy images, which are often 1024×1024 or larger, due to limited memory in most

GPUs. This was problematic as one of my original agreements about my research was that I would demonstrate that DNNs could be applied to large electron microscopy images, which Richard Beanland and I decided were at least Overt examples include predicting unknown pixels for compressed sensing with static 4 (ch. 4) or adaptive 5 (ch. 5) sparse scans, and unknown phase information from image intensities 7 (ch. 7). More subtly, improving image signal-to-noise with an DNN 6 (ch. 6) is akin to improving signal-to-noise by increasing numbers of intensity measurements. Arguably, even search engines based on VAEs 2 (ch. 2) add information to images insofar that VAE 241 encodings can be compared to quantify semantic similarities between images. Ultimately, my DNNs add information to data that could already be understood from physical laws and observations. However, high-dimensional datasets can be difficult to utilize. Deep learning offers an effective and timely means to both understand high-dimensional data and leverage that understanding to produce results in a useable format. Thus, I both anticipate and encourage further investigation of deep learning in electron microscopy. 242 

There's Plenty of Room at the Top: What Will Drive Computer Performance After Moore's Law?

Revisiting Unreasonable Effectiveness of Data in Deep Learning Era

Review of Deep Learning Algorithms and Architectures

A Survey of Deep Learning and Its Applications: A New Paradigm to Machine Learning

A State-of-the-Art Survey on Deep Learning Theory and Architectures

A Survey on Deep Learning for Big Data

A Survey of Deep Learning: Platforms, Applications and Emerging Research Trends

Deep Learning

Deep Learning in Neural Networks: An Overview

Machine Learning and the Physical Sciences

On the Use of Deep Learning for Computational Imaging

From DFT to Machine Learning: Recent Approaches to Materials Science -A Review

Introducing Machine Learning: Science and Technology

The Deep Learning Revolution

The History Began from AlexNet: A Comprehensive Survey on Deep Learning Approaches

Deep Neural Networks are More Accurate than Humans at Detecting Sexual Orientation from Facial Images

Deep Networks can Resemble Human Feed-Forward Vision in Invariant Object Recognition

Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification

Surpassing Human-Level Face Verification Performance on LFW with GaussianFace

AlphaStar: Mastering the Real-Time Strategy Game StarCraft II

Beating the World's Best at Super Smash Bros

Playing FPS Games with Deep Reinforcement Learning

Mastering the Game of Go with Deep Neural Networks and Tree Search

Revisiting Generalization for Deep Learning: PAC-Bayes, Flat Minima, and Generative Models

Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks

Understanding Training and Generalization in

Exploring Generalization in Deep Learning

Towards Understanding Generalization of Deep Learning: Perspective of Loss Landscapes

Deep Learning

Discovering Physical Concepts with Neural Networks

Toward an Artificial Intelligence Physicist for Unsupervised Learning

A Survey of Accelerator Architectures for

Hardware Architectures for the Fast Fourier Transform

Discrete Fourier Transform Computation Using Neural Networks

The FFT on a GPU

Newton Versus the Machine: Solving the Chaotic Three-Body Problem Using Deep Neural Networks

Deep Learning and Density-Functional Theory

Deep Neural Network Computes Electron Densities and Energies of a Large Set of Organic Molecules Faster than Density Functional Theory (DFT)

Fast Phase Retrieval in Off-Axis Digital Holographic Microscopy Through Deep Learning

Improving Electron Micrograph Signal-to-Noise with an Atrous Convolutional Encoder-Decoder

ImageNet Classification with Deep Convolutional Neural Networks

Improving Electron Micrograph Signal-to-Noise with an Atrous Convolutional Encoder-Decoder

Overview of Image Denoising Based on Deep Learning

Deep Learning on Image Denoising: An Overview

Deep Learning-Based Electrocardiogram Signal Noise Detection and Screening Model

Deep Recurrent Neural Networks for ECG Signal Denoising

Probabilistic Self-Learning Framework for Low-Dose CT Denoising

Medical Image Denoising Using Convolutional Neural Network: A Residual Learning Approach

Speckle Noise Removal in Ultrasound Images Using a Deep Convolutional Neural Network and a Specially Designed Loss Function

Deep-Learning-Based Image Reconstruction and Enhancement in Optical Microscopy

Denoising of Stimulated Raman Scattering Microscopy Images via Deep Learning

A Deep Learning Approach to Denoise Optical Coherence Tomography Images of the Optic Nerve Head

Cycle-Consistent Deep Learning Approach to Coherent Noise Reduction in Optical Diffraction tomography

A Review of Multi-Objective Deep Learning Speech Denoising Methods

Phase-Aware Single-Stage Speech Denoising and Dereverberation with U-Net

Learning Spectral Mapping for Speech Dereverberation and Denoising

Image Denoising Review: From Classical to State-of-the-Art Approaches

Brief Review of Image Denoising Techniques

Investigation on the Effect of a Gaussian Blur in Image Filtering and Segmentation

An Adaptive Gaussian Filter for Noise Reduction and Edge Detection

An Automatic Parameter Decision System of Bilateral Filtering with GPU-Based Acceleration for Brain MR Images

Image Denoising Using Optimally Weighted Bilateral Filters: A Sure and Fast Approach

Adaptive-Weighted Bilateral Filtering and Other Pre-Processing Techniques for Optical Coherence Tomography

Bilateral Filtering for Gray and Color Images

An Efficient Image Denoising Scheme for Higher Noise Levels Using Spatial Domain Filters

Anisotropic Diffusion Filter with Low Arithmetic Complexity for Images

Scale-Space and Edge Detection Using Anisotropic Diffusion

Progressive Switching Median Filter for the Removal of Impulse Noise from Highly Corrupted Images

Optimal Weighted Median Filtering Under Structural Constraints

Wiener Filter Reloaded: Fast Signal Reconstruction Without Preconditioning

Efficient Wiener Filtering Without Preconditioning

Principles of Digital Wiener Filtering

An Iterative Wavelet Threshold for Signal Denoising

Image De-Noising Using Discrete Wavelet Transform

A New SURE Approach to Image Denoising: Interscale Orthonormal Wavelet Thresholding. IEEE Transactions on Image Process

Empirical Bayes Approach to Improve Wavelet Thresholding for Image Noise Reduction

Adaptive Wavelet Thresholding for Image Denoising and Compression

Ideal Spatial Adaptation by Wavelet Shrinkage

The Curvelet Transform

The Curvelet Transform for Image Denoising

Nonparametric Denoising Methods Based on Contourlet Transform with Sharp Frequency Localization: Application to Low Exposure Time Electron Microscopy Images

The Contourlet Transform: An Efficient Directional Multiresolution Image Representation

Wavelet Packet Based CT Image Denoising Using Bilateral Method and Bayes Shrinkage Rule

Hybrid Method for Medical Image Denoising Using Shearlet Transform and Bilateral Filter

Image De-Noising by Using Median Filter and Weiner Filter

Spatial and Temporal Bilateral Filter for Infrared Small Target Enhancement

Dual-Domain Image Denoising

BM3D Frames and Variational Image Deblurring

Image Denoising by Sparse 3-D Transform-Domain Collaborative Filtering

Image Denoising via Sparse Representation Over Grouped Dictionaries with Adaptive Atom Size

From Heuristic Optimization to Dictionary Learning: A Review and Comprehensive Comparison of Image Denoising Algorithms

Clustering-Based Denoising with Locally Learned Dictionaries

An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation

Image Denoising via Sparse and Redundant Representations Over Learned Dictionaries

Shot-Noise-Limited Nanomechanical Detection and Radiation Pressure Backaction from an Electron Beam

Theoretical Framework of Statistical Noise in Scanning Transmission Electron Microscopy

Electron Dose Dependence of Signal-to-Noise Ratio, Atom Contrast and Resolution in Transmission Electron Microscope Images

A Statistical Model of Signal-Noise in Scanning Electron Microscopy

Effect of Shot Noise and Secondary Emission Noise in Scanning Electron Microscope Images

Noise Models in Digital Image Processing

Characterisation of the Signal and Noise Transfer of CCD Cameras for Electron Detection

Performance of a Low-Noise CCD Camera Adapted to a Transmission Electron Microscope

Optics of High-Performance Electron Microscopes

Understanding of Scanning-System Distortions of Atomic-Scale Scanning Transmission Electron Microscopy Images for Accurate Lattice Parameter Measurements

Dynamic Scan Control in STEM: Spiral Scans

Scanning Distortion Correction in STEM Images

Correcting Nonlinear Drift Distortion of Scanning Probe and Scanning Transmission Electron Microscopies from Image Pairs with Orthogonal Scan Directions

Identifying and Correcting Scan Noise and Drift in the Scanning Transmission Electron Microscope

Situ Transmission Electron Microscopy of Electron-Beam Induced Damage Process in Nuclear Grade Graphite

An Interactive ImageJ Plugin for Semi-Automated Image Denoising in Electron Microscopy

Evaluation of Denoising Algorithms for Biological Electron Tomography

Deep Learning Supersampled Scanning Transmission Electron Microscopy

A Comparison Study for Image Compression Based on Compressive Sensing

An Introduction to Compressed Sensing

A Systematic Review of Compressive Sensing: Concepts, Implementations and Applications

Compressed Sensing: Theory and Applications

Compressed Sensing

Improving the Speed of MRI with Artificial Intelligence

Compressed Sensing MRI: A Review from Signal Processing Perspective

Sparse MRI: The Application of Compressed Sensing for Rapid MR Imaging. Magn. Reson. Medicine: An Off

Image Compression Based on Compressive Sensing: End-to-end Comparison with JPEG

Compressed Sensing for Image Compression: Survey of Algorithms

Deep Learning for Image Super-Resolution: A Survey

Deep Learning for Single Image Super-Resolution: A Brief Review

Low-Dose Abdominal CT Using a Deep Learning-Based Denoising Algorithm: A Comparison with CT Reconstructed with Filtered Back Projection or Iterative Reconstruction Algorithm

Deep-Learning-Based Breast CT for Radiation Dose Reduction

Adaptive Compressed Tomography Sensing

Robust Perceptual Night Vision in Thermal Colorization

Learning to See in the Dark

The Energy Dependence of Contrast and Damage in Electron Cryomicroscopy of Biological Molecules

Radiation Damage in Nanostructured Materials

Electron Radiation Damage Mechanisms in 2D MoSe 2

The Effect of Electron Beam Irradiation in Environmental Scanning Transmission Electron Microscopy of Whole Cells in Liquid

Dose-Rate-Dependent Damage of Cerium Dioxide in the Scanning Transmission Electron Microscope

Characterisation of Radiation Damage by Transmission Electron Microscopy

Crystallography Open Database -An Open-Access Collection of Crystal Structures

The American Mineralogist Crystal Structure Database

Recent Developments in the Inorganic Crystal Structure Database: Theoretical Crystal Structure Data and Related Features

The Introduction of Structure Types into the Inorganic Crystal Structure Database ICSD

The Inorganic Crystal Structure Database (ICSD) -Present and Future

New Developments in the Inorganic Crystal Structure Database (ICSD): Accessibility in Support of Materials Research and Design

Crystallographic Databases

NIST Crystallographic Databases for Research and Analysis

The Kinetics Human Action Video Dataset

YouTube-8M: A Large-Scale Video Classification Benchmark

Innovative Technologies for Content and Data Curation

DeepDicomSort: An Automatic Sorting Algorithm for Brain Magnetic Resonance Imaging Data

Medical Data Quality Assessment: On the Development of an Automated Framework for Medical Data Curation

ADeX: A Tool for Automatic Curation of Design Decision Knowledge for Architectural Decision recommendations

Data curation with deep learning

Scaling up Data Curation Using Deep Learning: An application to Literature Triage in Genomic Variation Resources

Big Data Curation

Software Heritage: Why and How to Preserve Software Source Code

Google Cloud Source Repositories

Understanding Watchers on GitHub

On the Relation Between GitHub Communication Activity and Merge Conflicts

A Large Scale Study of Long-Time Contributor Prediction for GitHub Projects

Do as I Do, Not as I Say: Do Contribution Guidelines Match the GitHub Contribution Process?

More Common than Tou Think: An In-Depth Study of Casual Contributors

How GitHub Contributing.md Contributes to Contributors

Studying in the 'Bazaar': An Exploratory Study of Crowdsourced Learning in GitHub

The Signals that Potential Contributors Look for When Choosing Open-source Projects

Open Source Software Hosting Platforms: A Collaborative Perspective's Review

Wikipedia Contributors. Comparison of source-code-hosting facilities -Wikipedia, the free encyclopedia

How are Alexa's Traffic Rankings Determined

Invisible Search and Online Search Engines: The Ubiquity of Search in Everyday Life

Measuring the Importance of User-Generated Content to Search Engines

The Role and Importance of Search Engine and Search Engine Optimization

The Anatomy of a Large-Scale Hypertextual Web Search Engine

The effect of content-equivalent near-duplicates on the evaluation of search engines

The Impact of Google on Discovering Scholarly Information: Managing STM publishers' Visibility in Google. Collect. Curation

Retrieval Performance of Google, Yahoo and Bing for Navigational Queries in the Field of

Retrieval Performance of Select Search Engines in the Field of Physical Sciences

Seek and You Shall Find? A Content Analysis on the Diversity of Five Search Engines' Results on Political Queries

Evaluating the Effectiveness of Web Search Engines on Results Diversification

Evaluation of Search Engines Using Advanced Search: Comparative Analysis of Yahoo and Bing

Evaluating the Effectiveness of Google, Parsijoo, Rismoon, and Yooz to Retrieve Persian Documents

Google Scholar to Overshadow Them All? Comparing the Sizes of 12 Academic Search Engines and Bibliographic Databases

Dimensions: Building Context for Search and Evaluation

Will Web Search Engines Replace Bibliographic Databases in the Systematic Identification of Research?

Anatomy and Evolution of Database Search Engines -A Central Component of Mass Spectrometry Based Proteomic Workflows

Deep Job Understanding at LinkedIn

Study of the Usability of LinkedIn: A Social Media Platform Meant to Connect Employers and Employees

Patent Prior Art Search Using Deep Learning Language Model

Prior Art Search Using Multi-modal Embedding of Patent Documents

Patent Retrieval: A Literature Review

Academic Social Networks: Modeling, Analysis, Mining and Applications

Global Social Networking Sites and Global Identity: A Three-Country Study

An Experiment in Hiring Discrimination via Online Social Networks

The Case for Voter-Centered Audits of Search Engines During Political Elections

Search Bias Quantification: Investigating Political Bias in Social Media and Web Search

Beyond the Bubble: Assessing the Diversity of Political Search Results

Google Search Survey: How Much Do Users Trust Their Search Results? MOZ

Lectures, Textbooks, Academic Calendar, and Administration: An Agenda for Change

Teaching and Learning Without a Textbook: Undergraduate Student Perceptions of Open Educational Resources

Growth Rates of Modern Science: A Bibliometric Analysis Based on the Number of Publications and Cited References

Journal Impact Factor: A Bumpy Ride in an Open Space

Building Journal Impact Factor Quartile into the Assessment of Academic Performance: A Case Study

Should Highly Cited Items be Excluded in Impact Factor Calculation? The Effect of Review Articles on

Top Most Research Tools For Selecting The Best Journal For Your Research Article. Pubrica

Rise of the Rxivs: How Preprint Servers are Changing the Publishing Process

Preprints and Preprint Servers as Academic Communication Tools. Revista Cuba. de Información en Ciencias de la Salud

ArXiv at 20

The Relationship Between bioRxiv Preprints

The Impact of Preprints in Library and Information Science: An Analysis of Citations, Usage and Social Attention Indicators

Open Access to Scholarly Communications: Advantages, Policy and Advocacy. Acceso Abierto a la información en las Bibliotecas Académicas de América Latina y el Caribe

Releasing a Preprint is Associated with More Attention and Citations for the Peer-Reviewed Article

Open Access Meets Discoverability: Citations to Articles Posted to Academia

State of Open Access Penetration in Universities Worldwide

The Pricing of Open Access Journals: Diverse Niches and Sources of Value in Academic Publishing

Is Open Access Affordable? Why Current Models Do Not Work and Why We Need Internet-Era Transformation of Scholarly Communications

Why Should You Publish in Machine Learning: Science and Technology? IOP Science

Open Journals that Piggyback on arXiv Gather Momentum

Which Are the Tools Available for Scholars? A Review of Assisting Software for Authors During Peer Reviewing Process

Alternatives You Can Use Today. Investintech

Introduction to LATEX and to Some of its Tools

Pimp Your Thesis: A Minimal Introduction to LATEX

LATEX: A document Preparation System: User's Guide and Reference Manual

Craft Beautiful Equations in Word with LaTeX

An Efficiency Comparison of Document Preparation Systems Used in Academic Research and Development

Why I Write with LaTeX (and Why You Should Too). Medium

The LaTeX Fetish (Or: Don't Write in LaTeX! It's Just for Typesetting

Microscopic Techniques for the Analysis of Micro and Nanostructures of Biopolymers and Their Derivatives

Microscopy and Spectroscopy Techniques for Characterization of Polymeric Membranes

Characterisation Methods in Solid State and Materials Science

Review of Progress in Atomic Force Microscopy

Atomic Force Microscopy

Artificial-Intelligence-Driven Scanning Probe Microscopy

Fourier Transform Infrared Spectroscopy

Recent Advances in Solid-State Nuclear Magnetic Resonance Techniques for Materials Research

Nuclear Magnetic Resonance Spectroscopy: An Introduction to Principles, Applications, and Experimental Methods

Introduction to Nuclear Magnetic Resonance

A New Method of Measuring Nuclear Magnetic Moment

A Practical Approach

Raman techniques: Fundamentals and frontiers

A Review of Basic Crystallography and X-Ray Diffraction Applications

Review on the Raman Spectroscopy of Different Types of Layered Materials

Diffraction (XRD) Techniques for Materials Characterization

Noninvasive Molecular Imaging of Small Living Subjects using Raman Spectroscopy

Experimental Methods in Chemical Engineering: X-Ray Diffraction Spectroscopy -XRD. The Can

Dynamic X-Ray Diffraction Sampling for Protein Crystal Positioning

Energy Dispersive Inelastic X-Ray Scattering Spectroscopy -A Review

Atomic Spectrometry Update -A Review of Advances in X-Ray Fluorescence Spectrometry and its Special Applications

Practical Guides for X-Ray Photoelectron Spectroscopy: First Steps in Planning, Conducting, and Reporting XPS Measurements

Relative Merits and Limiting Factors for X-Ray and Electron Microscopy of Thick

Technique of Reflection Electron Microscopy

Reflection Electron Microscopy

Scanning Electron Microscopy (SEM): A Review

Scanning Electron Microscopy and X-Ray Microanalysis

Introduction to Scanning Transmission Electron Microscopy (Routledge

Scanning Transmission Electron Microscopy: Imaging and Analysis

Scanning Tunneling Microscopy in Surface Science

Invited Review Article: Multi-Tip Scanning Tunneling Microscopy: Experimental Techniques and Data Analysis

Transmission Electron Microscopy: Diffraction, Imaging, and Spectrometry

Transmission Electron Microscopy (TEM)

Transmission Electron Microscopy in Molecular Structural Biology: A Historical Survey

Dimensions: Bringing Down Barriers Between Scientometricians and Data

A Guide to the Dimensions Data Approach

Technology Networks: Analysis & Separations

Purchasing an Electron Microscope? -Considerations and Scientific Strategies to Help in the Decision Making Process

Reflection High-Energy Electron Diffraction

Reflection High-Energy Electron Diffraction During Crystal Growth

Reflection High-Energy Electron Diffraction Measurements of Reciprocal Space Structure of 2D Materials

Reflection High-Energy Electron Loss Spectroscopy (RHEELS): A New Approach in the Investigation of Epitaxial Thin Film Growth by Reflection High-Energy Electron Diffraction (RHEED). Vacuum

Reflection Electron Energy Loss Spectroscopy During Initial Stages of Ge Growth on Si by Molecular Beam Epitaxy

Aberration Corrected Spin Polarized Low Energy Electron Microscope

Springer Handbook of Microscopy

A Study of Chiral Magnetic Stripe Domains Within an In-Plane Virtual Magnetic Field Using SPLEEM

Auger Electron Spectroscopy

Auger Electron Spectroscopy in the Scanning Electron Microscope: Auger Electron Images

Energy Dispersive X-Ray (EDX) Microanalysis: A Powerful Tool in Biomedical Research and Diagnosis

Quantitative Atomic Resolution Elemental Mapping via Absolute-Scale Energy Dispersive X-Ray Spectroscopy

Benefits from Bremsstrahlung Distribution Evaluation to get Unknown Information from Specimen in SEM and TEM

Recommended Values of the Fundamental Physical Constants: 2014

An Introduction to Special Relativity

Investigations in Optics, with Special Reference to the Spectroscope. The London

Beyond Rayleigh's Criterion: A Resolution Measure with Application to Single-Molecule Microscopy

The Principle of Relativity and the de Broglie Relation

Broglie's Thesis: A Critical Retrospective

Glossary of TEM Terms: Wavelength of Electron. JEOL

High-Precision Measurement of the X-Ray Cu Kα Spectrum

Transmission Electron Microscopy vs Scanning Electron Microscopy. ThermoFisher Scientific

Spatial Coherence of Electron Beams from Field Emitters and its Effect on the Resolution of Imaged Objects

Persistent Misconceptions about Incoherence in Electron Microscopy

Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks

Comparison of Batch Normalization and Weight Normalization Algorithms for the Large-Scale Image Classification

Spectral Normalization for Generative Adversarial Networks

Estimation of the Lipschitz Constant of a Function

Machine Learning -Singular Value Decomposition (SVD) & Principal Component Analysis (PCA)

Singular Value Decomposition and its Applications in Principal Component Analysis

Singular Value Decomposition and Principal Component Analysis

The Singular Value Decomposition: Its Computation and Some Applications

Spectral Norm Regularization for Improving the Generalizability of Deep Learning

Eigenvalue Computation in the 20th Century

Transformers Without Tears: Improving the Normalization of Self-Attention

Improving Lexical Choice in Neural Machine Translation

Simple Introduction to Convolutional Neural Networks

Introduction to Convolutional Neural Networks. Natl. Key Lab for Nov

Convolutional Neural Networks for Inverse Problems in Imaging: A Review

An Introduction to Convolutional Neural Networks

Receptive Fields and Functional Architecture of Monkey Striate Cortex

A Self-Organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position

Neocognitron: A Self-Organizing Neural Network Nodel for a Mechanism of Visual Pattern Recognition

Neocognitron: A Hierarchical Neural Network Capable of Visual Pattern Recognition

Neocognitron for Handwritten Digit Recognition

An Artificial Neural Network for Spatio-Temporal Bipolar Patterns: Application to Phoneme Classification

Gradient-Based Learning Applied to Document Recognition

Object Recognition with Gradient-Based Learning

Simple Neural Nets for Handwritten Digit Recognition

A Review of Convolutional-Neural-Network-Based Action Recognition

Image and Video Compression with Neural Networks: A Review

Deep Learning-Based Video Coding: A Review and a Case Study

Deep Neural Network Concepts for Background Subtraction: A Systematic Review and Comparative Evaluation

Medical Image Analysis using Convolutional Neural Networks: A Review

Convolutional Neural Networks for Radiologic Images: A Radiologist's Guide

Convolutional Neural Networks: An Overview and Application in Radiology

Deep Convolutional Neural Networks for Brain Image Analysis on Magnetic Resonance Imaging: A Review

Deep Learning in Medical Image Registration: A

Application of Deep Learning for Retinal Image Analysis: A review

Applications of Deep Learning to MRI Images: A Survey

Object Detection with Deep Learning: A Review

Salient Object Detection in the Deep Learning Era: An In-Depth Survey

Deep Learning Based Text Classification: A Comprehensive Review

TensorFlow Core v2.2.0 Python Documentation for Convolutional Layer

A Computational Introduction to Digital Image Processing

Smoothing Images. OpenCV Documentation

Edge Detection of Images Using Sobel Operator

Principles of filter design

Optimal Operators in Digital Image Processing

Edge Detection on Images of Pseudoimpedance Section Supported by Context and Adaptive Transformation Model Images

Machine Perception of Three-Dimensional Solids

Object Enhancement and Extraction

Convolutional Neural Networks for Feedforward Acceleration

Extremely Separated Convolution

Speeding up Convolutional Neural Networks with Low Rank Expansions

Convolution With Even-Sized Kernels and Symmetric Padding

Efficient N-Dimensional Convolutions via Higher-Order Factorization

Using Constant Padding, Reflection Padding and Replication Padding with Keras. MachineCurve

Ultra-Deep Neural Networks Without Residuals

Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning

Rethinking the Inception Architecture for Computer Vision

Going Deeper with Convolutions

Learning Transferable Architectures for Scalable Image Recognition

Deeply-Recursive Convolutional Network for Image Super-Resolution

Image Super-Resolution via Deep Recursive Residual Network

Deep Residual Learning for Image Recognition

Effects of Padding on LSTMs and CNNs

Image Inpainting for Irregular Holes Using Partial Convolutions

Automatic Construction of Multi-Layer Perceptron Network from Streaming Examples

Towards Learning Convolutions from Scratch

3D Deep Encoder-Decoder Network for Fluorescence Molecular Tomography

Tensor-Train Decomposition

Tensorizing Neural Networks

Take it in Your Stride: Do We Need Striding in CNNs? arXiv preprint

On The Use of Variable Stride in Convolutional Neural Networks

Is the Deconvolution Layer the Same as a Convolutional Layer?

Checkerboard Artifact Free Sub-Pixel Convolution: A Note on Sub-Pixel Convolution

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

From Regular to Depthwise Separable Convolutions

Depthwise Separable Convolutional Neural Networks. GeeksforGeeks

Depth-wise Separable Convolutions: Performance Investigations

The Eye

Why Rods and Cones? Eye

Rods and Cones

Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition

Image Recognition Using Scale Recurrent Neural Networks

Introduction to Fourier Transforms for TEM and STEM

The Design and Implementation of FFTW3

The Fast Fourier Transform Partitioning Scheme for GPU's Computation Effectiveness Improvement

Large-Scale FFT on GPU Clusters

An Empirically Tuned 2D and 3D FFT Library on CUDA GPU

Effectiveness of Fast Fourier Transform Implementations on GPU and CPU

Model-Based CPU-GPU Heterogeneous FFT Library

An Algorithm for the Machine Calculation of

Fast Fourier Transforms: A Tutorial Review and A State of the Art

Very Efficient Training of Convolutional Neural Networks Using Fast Fourier Transform and Overlap-and-Add

Convolution Theorem. Wolfram Mathworld -A Wolfram Web Resource

FCNN: Fourier Convolutional Neural Networks

Very Deep Convolutional Networks for Large-Scale Image Recognition

Neocortical Layer 6, A Review

The Functional Organization of Local Circuits in Visual Cortex: Insights From the Study of Tree Shrew Striate Cortex

Norm-Preservation: Why Residual Networks can Become Extremely Deep?

Depth with Nonlinearity Creates No Bad Local Minima in ResNets

Visualizing the Loss Landscape of Neural Nets

Residual Networks Behave Like Ensembles of Relatively Shallow Networks

Highway and Residual Networks Learn Unrolled Iterative Estimation

A Simple Yet Effective Baseline for 3D Human Pose Estimation

Residual Recurrent Neural Networks for Learning Sequential Representations

Design of a Deep Recurrent Architecture for Distant Speech Recognition

Google's Neural Machine Translation System: Bridging the Gap Between Human and Machine Translation

Training Very Deep Networks

Convolutional Networks with Dense Connectivity

Densely Connected Convolutional Networks

Image Super-Resolution Using Dense Skip Connections

An End-to-End Compression Framework Based on Convolutional Neural Networks

Mean Field Residual Networks: On the Edge of Chaos

Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10,000-Layer Vanilla Convolutional Neural Networks

Concatenate Convolutional Neural Networks for Non-Intrusive Load Monitoring Across Complex Background

Vertex Reconstruction of Neutrino Interactions Using Deep Learning

Learning Precise Timing with LSTM Recurrent Networks

LSTM Recurrent Networks Learn Simple Context-Free and Context-Sensitive Languages

Attention is All You Need

The Illustrated Transformer

Recurrent Models of Visual Attention

Multiple Object Recognition with Visual Attention

Continuous Control with Deep Reinforcement Learning

Memory-Based Control with Recurrent Neural Networks

Actor-Critic Algorithms

Surrogate Gradient Learning in Spiking Neural Networks

Generative Adversarial Network Training is a Continual Learning Problem

Decoupled Neural Interfaces Using Synthetic Gradients

Image Quality Assessment: From Error Visibility to Structural Similarity

Loss Functions of Generative Adversarial Networks (GANs): Opportunities and Challenges

Towards a Deeper Understanding of Adversarial Losses

A Large-Scale Study on Regularization and Normalization in GANs

Stabilizing Training of Generative Adversarial Networks Through Regularization

Generative Adversarial Nets

On the Effectiveness of Least Squares Generative Adversarial Networks

Least Squares Generative Adversarial Networks

Stabilizing Generative Adversarial Network Training: A Survey

Solving Mode Collapse Using Manifold Guided Training

Wasserstein Generative Adversarial Networks

Improved Training of Wasserstein GANs

Adversarial Perturbations of Deep Neural Networks

Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks

Multitask Learning with Single Gradient Step Update for Task Balancing

Self-Attention Generative Adversarial Networks

GAN Training for High Fidelity Natural Image Synthesis

High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs

Unsupervised Video-to-Video Translation

Unsupervised Image-to-Image Translation Networks

Image-to-Image Translation by Transformation Vector Learning

Adversarial Discriminative Domain Adaptation

Unsupervised Domain Adaptation by Backpropagation

Simultaneous Deep Transfer Across Domains and Tasks

Backpropagation Through Time: What It Does and How To Do It

Asymptotic Optimality of Finite Model Approximations for Partially Observed Markov Decision Processes With Discounted Cost

Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems

Attend and Tell: Neural Image Caption Generation with Visual Attention

Show and Tell: A Neural Image Caption Generator

Survey on Neural Machine Translation for Multilingual Translation System

Deep Learning in Clinical Natural Language Processing: A Methodical Review

A Survey of the Usages of Deep Learning for Natural Language Processing

Forecasting Sparse Traffic Congestion Patterns Using Message-Passing RNNs

Deep CNN-LSTM with Word Embeddings for News Headline Sarcasm Detection

Model for Document-Level Sentiment Analysis

A Combination of RNN and CNN for Attention-Based Relation Classification

Question Answering Over Freebase via Attentive RNN with Similarity Matrix Based CNN

From Pre-trained Word Embeddings To Pre-trained Language Models -Focus on BERT

Pre-Training of Deep Bidirectional Transformers for Language Understanding

Distributed Representations of Words and Phrases and Their Compositionality

Learning Word Embeddings Efficiently with Noise-Contrastive Estimation

Learning Word Vectors for 157 Languages

Distributed Representations of Sentences and Documents

An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation

GloVe: Global Vectors for Word Representation

Efficient Estimation of Word Representations in Vector Space

Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) Network. Phys. D: Nonlinear Phenom

Learning to Forget: Continual Prediction with LSTM

Long Short-Term Memory

Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation

Gate-Variants of Gated Recurrent Unit (GRU) Neural Networks

Simplified Minimal Gated Unit Variations for Recurrent Neural Networks

On the Difficulty of Training Recurrent Neural Networks

Which Neural Net Architectures Give Rise to Exploding and Vanishing Gradients?

Massive Exploration of Neural Machine Translation Architectures

An Empirical Exploration of Recurrent Network Architectures

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

Are GRU Cells More Specific and LSTM Cells More Sensitive in Motive Classification of Text? Front

On the Practical Computational Power of Finite Precision RNNs for Language Recognition

Evolving Memory Cell Structures for Sequence Learning

Minimal Gated Unit for Recurrent Neural Networks

LSTM: A Search Space Odyssey

Continuous Time RNNs

Approximation of Dynamical Systems by Continuous Time Recurrent Neural Networks

Evolving Communication Without Dedicated Communication Channels

The Dynamics of Adaptive Behavior: A Research Program

Seeing the Light: Artificial Evolution, Real Vision

Finding Structure in Time

Serial Order: A Parallel Distributed Processing Approach

Independently Recurrent Neural Network (IndRNN): Building a Longer and Deeper RNN

Logic Learning in Hopfield Networks

Recurrent Multilayer Perceptrons for Identification and Control: The Road to

Long Short-Term Memory Projection Recurrent Neural Network Architectures for Piano's Continuous Note Recognition

How to Construct Deep Recurrent Neural Networks

Bidirectional Recurrent Neural Networks

Neural Machine Translation by Jointly Learning to Align and Translate

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

Bidirectional Long Short-Term Memory Networks for Predicting the Subcellular Localization of Eukaryotic Proteins

On the Properties of Neural Machine Translation: Encoder-Decoder Approaches

Learning Structured Representation for Text Classification via Reinforcement Learning

Hierarchical Multiscale Recurrent Neural Networks

A Hierarchical Recurrent Encoder-Decoder for Generative Context-Aware Query Suggestion

How Hierarchical Control Self-Organizes in Artificial Adaptive Systems

Extended Sequences Using the Principle of History Compression

Emergence of Functional Hierarchy in a Multiple Timescale Neural Network Model: A Humanoid Robot Experiment

The Hierarchical and Functional Connectivity of Higher-Order Cognitive Mechanisms: Neurorobotic Model to Investigate the Stability and Flexibility of Working Memory

An Attentive Survey of Attention Models

Effective Approaches to Attention-Based Neural Machine Translation

Neural Machine Translation by Jointly Learning to Align and Translate

Hybrid Computing Using a Neural Network with Dynamic External Memory

Recent Advances in Autoencoder-Based Representation Learning

Reducing the Dimensionality of Data with Neural Networks

Nonlinear Principal Component Analysis Using Autoassociative Neural Networks

Principal Component Analysis: A Review and Recent Developments

Lossy Image Compression with Compressive Autoencoders

Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion

Extracting and Composing Robust Features with Denoising Autoencoders

Medical Image Denoising Using Convolutional Denoising Autoencoders

Simple Sparsification Improves Sparse Denoising Autoencoders in Denoising Highly Corrupted Images

Boltzmann Machines and Denoising Autoencoders for Image Denoising

Contractive Auto-Encoders: Explicit Invariance During Feature Extraction

Higher Order Contractive Auto-Encoder

An Introduction to Variational Autoencoders

3D Object Recognition with Deep Belief Nets

Why Regularized Auto-Encoders Learn Sparse Representation

Facial Expression Recognition via Learning Deep Sparse Autoencoders

VAEGAN: A Collaborative Filtering Framework Based on Adversarial Variational Autoencoders

Autoencoding Beyond Pixels Using a Learned Similarity Metric

A New Variational Method for Deep Supervised Semantic Image Hashing

Deep Hashing Based on VAE-GAN for Efficient Similarity Retrieval

Deep Semantic Hashing with Multi-Adversarial Training

Semantic Hashing with Variational Autoencoders

Video Anomaly Detection and Localization via Gaussian Mixture Fully Convolutional Variational Autoencoder

Unsupervised Anomaly Detection Using Variational Auto-Encoder Based Feature Extraction

Unsupervised Anomaly Detection via Variational Auto-Encoder for Seasonal KPIs in Web Applications

Variational Autoencoder Based Anomaly Detection Using Reconstruction Probability

Reverse Variational Autoencoder for Visual Attribute Manipulation and Anomaly Detection

Learning Latent Subspaces in Variational Autoencoders

How to Generate Micro-Agents? A Deep Generative Modeling Approach to Population Synthesis

Synthetic Patient Generation: A Deep Learning Approach Using Variational Autoencoders

Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules

Deep Learning Enables Rapid Identification of Potent DDR1 Kinase Inhibitors

Constrained Bayesian Optimization for Automatic Chemical Design Using Variational Autoencoders

Molecular Generative Model Based on Conditional Variational Autoencoder for de novo Molecular Design

Variational Autoencoder Based Synthetic Data Generation for Imbalanced Learning

Machine Learning Testing: Survey, Landscapes and Horizons

Software Engineering for Machine Learning: A Case Study

The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction

Hidden Technical Debt in Machine Learning Systems

Loss landscape mit license

A Semiotic Reflection on the Didactics of the Chain Rule

Stochastic Estimation of the Maximum of a Regression Function

A Stochastic Approximation Method. The Annals Math

Some Methods of Speeding up the Convergence of Iteration Methods

On the Importance of Initialization and Momentum in Deep Learning

A Differential Equation for Modeling Nesterov's Accelerated Gradient Method: Theory and Insights

TensorFlow Source Code for Nesterov Momentum

Quasi-Qyperbolic Momentum and ADAM for Deep Learning

Aggregated Momentum: Stability Through Passive Damping

Neural Networks for Machine Learning Lecture 6a Overview of Mini-Batch Gradient Descent

A Method for Stochastic Optimization

A Survey of Optimization Methods from a Machine Learning Perspective

Optimization Methods for Large-Scale Machine Learning. Siam Rev

An Overview of Gradient Descent Optimization Algorithms

The Method of Steepest Descent for Non-Linear Minimization Problems

Cauchy and the Gradient Method

Saving Memory Using Gradient-Checkpointing

Theories of Error Back-Propagation in the Brain

Exercising Your Brain: A Review of Human Brain Plasticity and Training-Induced Learning

Dynamic Reconfiguration of Human Brain Networks During Learning

Reward Representations and Reward-Related Learning in the Human Brain: Insights from Neuroimaging

Adaptive Gradient Methods with Dynamic Bound of Learning Rate

On the Convergence of ADAM and Beyond

Lookahead Optimizer: k Steps Forward, 1 Step Back

Incorporating Nesterov Momentum into ADAM

Weighting More of the Past Gradients When Designing the Adaptive Learning Rate

On the Variance of the Adaptive Learning Rate and Beyond

Neural Optimizer Search with Reinforcement Learning

Learning to Learn by Gradient Descent by Gradient Descent

Learning to Learn Using Gradient Descent

Stochastic Gradient Descent Optimizes Over-Parameterized Deep ReLU Networks

Two Natural Weaknesses of Gradient Descent

Why Momentum Really Works

On the Momentum Term in Gradient Descent Learning Algorithms

Descending Through a Crowded Valley

On Empirical Comparisons of Optimizers for Deep Learning

The Marginal Value of Adaptive Gradient Methods in Machine Learning

A Comparative Analysis of Gradient Descent-Based Optimization Algorithms on Convolutional Neural Networks

Adaptive Gradient Clipping for Source Separation Networks

Stochastic Optimization with Heavy-Tailed Noise via Accelerated Gradient Clipping

Natural Images are Reliably Represented by Sparse and Variable Populations of Neurons in Visual Cortex

Importance of Hyperparameters of Machine Learning Algorithms

The Step Decay Schedule: A Near Optimal

Decaying Momentum Helps Neural Network Training

On Hyperparameter Optimization of Machine Learning Algorithms: Theory and Practice

Gradient Descent: The Ultimate Optimizer

A Next-Generation Hyperparameter Optimization Framework

Hyperparameter Optimization of Deep Neural Networks Using Mesh Adaptive Direct Search

Efficient Hyperparameter Optimization of Deep Learning Algorithms Using Deterministic RBF Surrogates

Particle Swarm Optimization for Hyper-Parameter Selection in Deep Neural Networks

Neural Network Learning Without Backpropagation

Learning Complexity of Simulated Annealing

Simulated Annealing: Practice versus Theory

Optimization of Convolutional Neural Network Using Microcanonical Annealing Algorithm

Simulated Annealing Algorithm for Deep Learning

A Method for Stochastic Optimization by Langevin Dynamics with Simulated Annealing

Hyper-Parameter Tuning by Simulated Annealing

Evolutionary Algorithms Review

Genetic Algorithms in Machine Learning

Fast genetic algorithms

Deep Neuroevolution: Genetic Algorithms are a Competitive Alternative for Training Deep Neural Networks for Reinforcement Learning

Deep Reinforcement Learning using Genetic Algorithm for Parameter Optimization

Genetic Algorithm-Guided Deep Learning of Grain Boundary Diagrams: Addressing the Challenge of Five Degrees of Freedom

Genetic Algorithms for Computational Materials Discovery Accelerated by

Augmenting Genetic Algorithms with Deep Neural Networks for Exploring the Chemical Space

Genetic Algorithms with DNN-Based Trainable Crossover as an Example of Partial Specialization of General Search

Direct Search Algorithms for Optimization Calculations

Particle Swarm Optimization of Deep Neural Networks Architectures for Image Classification

Parameters Optimization of Deep Learning Models Using Particle Swarm Optimization

Particle Swarm Optimization

The Particle Swarm: Social Adaptation of Knowledge

A Review of Machine Learning With Echo State Networks

Deep Echo State Network (DeepESN): A Brief Survey

Towards a More Efficient and Cost-Sensitive Extreme Learning Machine: A State-of-the-Art Review of Recent Trend

A Survey on Extreme Learning Machine and Evolution of Its Variants

Extreme Learning Machine: A Review

Extreme Learning Machine for Multilayer Perceptron

Extreme Learning Machine for Regression and Multiclass Classification

Extreme Learning Machine: Theory and Applications

Extreme Learning Machine: A New Learning Scheme of Feedforward Neural Networks

Deep Reinforcement Learning: An Overview

A Survey of Reinforcement Learning Techniques: Strategies, Recent Development, and Future Directions

Deep Reinforcement Learning Patents: An Empirical Survey

Deep Reinforcement Learning for Multiagent Systems: A Review of Challenges, Solutions, and Applications

Reinforcement Learning, Fast and Slow

A Tour of Reinforcement Learning: The View From Continuous Control

A Brief Survey of Deep Reinforcement Learning

Deep Reinforcement Learning for Autonomous Driving: A Survey

Autonomous Highway Driving Using Deep Reinforcement Learning

Exploring Applications of Deep Reinforcement Learning for Real-World Autonomous Driving Systems

Applications of Deep Reinforcement Learning in Communications and Networking: A Survey

Reinforcement Learning-Based Spectrum Management for Cognitive Radio Networks: A Literature Review and Case Study

A Review of Reinforcement Learning Methodologies for Controlling Occupant Comfort in Buildings

Human-Level Control Through Deep Reinforcement Learning

Review of Deep Reinforcement Learning for Robot Manipulation

Deep Reinforcement Learning for Soft, Flexible Robots: Brief Review with Impending Challenges

Analysis and Improvement of Policy Gradient Estimation

Exploration strategies in deep reinforcement learning

Parameter Space Noise for Exploration

On the Theory of the Brownian Motion

Addressing Function Approximation Error in Actor-Critic Methods

Has it Explored Enough? Master's thesis

Provably Efficient Maximum Entropy Exploration

Reinforcement Learning with Deep Energy-Based Policies

Understanding the Impact of Entropy on Policy Optimization

A Survey on Intrinsic Motivation in Reinforcement Learning

Adapting Behaviour via Intrinsic Reward: A Survey and Empirical Study

Curiosity-Driven Exploration by Self-Supervised Prediction

Online Learning: A Comprehensive Survey

Online Reinforcement Learning in Stochastic Games

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

Offline (Batch) Reinforcement Learning: A Review of Literature and Applications. Seita's Place

Self-Improving Reactive Agents Based on Reinforcement Learning, Planning and Teaching

AutoML: A Survey of the State-of-the-Art

Modeling Neural Architecture Search Methods for Deep Networks

Reinforcement Learning for Neural Architecture Search: A Review

Neural Architecture Search: A Survey

Automated Machine Learning: Review of the State-of-the-Art and Opportunities for Healthcare

AdaNet: A Scalable and Flexible Framework for Automatically Learning Ensembles

Introducing AdaNet: Fast and Flexible AutoML with Learning Guarantees

Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Image Segmentation

AutoGAN: Neural Architecture Search for Generative Adversarial Networks

Auto-Keras: An Efficient Neural Architecture Search System

Efficient and Robust Automated Machine Learning

Improved Differentiable Architecture Search with Early Stopping

Scalable Automatic Machine Learning

A Type-Based Declarative Deep Learning Toolbox

Optimizing Deep Learning Hyper-Parameters Through an Evolutionary Algorithm

167-PFLOPS Deep Learning for Electron Microscopy: From Learning Physics to Atomic Manipulation

Neural Architecture Search with Bayesian Optimisation and Optimal Transport

XNAS: Neural Architecture Search with Expert Advice

Accuracy vs. Efficiency: Achieving Both Through FPGA-Implementation Aware Neural Architecture Search

Progressive Neural Architecture Search

Graph Hypernetworks for Neural Architecture Search

Accelerating Neural Architecture Search Using Performance Prediction

Neural Architecture Search with Reinforcement Learning

Outperform Humans? An Evaluation on Popular OpenML Datasets Using AutoML Benchmark

Hyper-Parameters in Action! Part II -Weight Initializers

Generalization in Deep Networks: The Role of Distance From Initialization

Understanding the Difficulty of Training Deep Feedforward Neural Networks

On Weight Initialization in Deep Neural Networks

Exact Solutions to the Nonlinear Dynamics of Learning in

Recurrent Orthogonal Networks and Long-Memory Tasks

A Simple Way to Initialize Recurrent Networks of Rectified Linear Units

Learning Longer Memory in Recurrent Neural Networks

Non-Zero Initial States for Recurrent Neural Networks

All You Need is a Good Init

Random Walk Initialization for Training Very Deep Feedforward Networks

Initializing Learning by Learning to Initialize

Regularization for Deep Learning: A Taxonomy

Regularization in Deep Neural Networks

Regularization Matters in Policy Optimization

Regularized Deep Learning with Non-Convex Penalties

Time Matters in Regularizing Deep Networks: Weight Decay and Data Augmentation Affect Early Learning Dynamics, Matter Little Near Convergence

A New Angle on L2 Regularization

L2 Regularization versus Batch and Weight Normalization

The Lost Honour of L2-Based Regularization. Large Scale Inverse Probl

Compressible Distributions for High-Dimensional Statistics

Feature Selection, L1 vs. L2 Regularization, and Rotational Invariance

Regularization and Variable Selection via the Elastic Net

Regression Shrinkage and Selection via the Lasso

Ridge Regression: Biased Estimation for Nonorthogonal Problems

Why Gradient Clipping Accelerates Training: A Theoretical Justification for Adaptivity

Understanding Gradient Clipping in Private SGD: A Geometric Perspective

Can Gradient Clipping Mitigate Label Noise?

Advances in Optimizing Recurrent Networks

The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation

Dropout: A Simple Way to Prevent Neural Networks from Overfitting

Survey of Dropout Methods for Deep Neural Networks

Improved Dropout for Shallow and Deep Learning

On the Implicit Bias of Dropout

An Empirical Analysis of Dropout in Piecewise Linear Networks

Batch Normalization: An Empirical Study of Their Impact to Deep Learning

Effective and Efficient Dropout for Deep Convolutional Neural Networks

DropBlock: A Regularization Method for Convolutional Networks

A Regularization Technique for Convolutional Neural Networks

Shakeout: A New Approach to Regularized Deep Neural Network Training

Shakeout: A New Regularized Deep Neural Network Training Scheme

Towards Understanding the Importance of Noise in Training Neural Networks

Speech Recognition with Deep Recurrent Neural Networks

Practical Variational Inference for Neural Networks

A Limitation of Gradient Descent Learning

Using Additive Noise in Back-Propagation Training

Adversarial Noise Layer: Regularize Neural Network by Adding Noise

On Stabilizing Generative Adversarial Training with Noise

Limited Gradient Descent: Learning With Noisy Labels

A Tail-Index Analysis of Stochastic Gradient Noise in

Adding Gradient Noise Improves Learning for Very Deep Networks

A Survey on Image Data Augmentation for Deep Learning

Automatic Data Augmentation for Generalization in Deep Reinforcement Learning

On Regularization Properties of Artificial Datasets for Deep Learning

An Overview of Deep Semi-Supervised Learning

Semi-Supervised Learning: the Case When Unlabeled Data is Equally Useful

A Review of Various Semi-Supervised Learning Models with a Deep Learning and Memory Approach

Semi-Supervised Learning with Ladder Networks

Pseudo-Label: The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks

Multiview Transfer Learning and Multitask Learning

An Overview of Multi-Task Learning in

Conditional Image Synthesis With Auxiliary Classifier GANs

Learns a Biased Distribution

Twin Auxilary Classifiers GAN

Unbiased Auxiliary Classifier GANs with MINE

Better Performance with the tf.data API. TensorFlow Documentation

On feature normalization and data augmentation

Impact of Data Normalization on Deep Neural Network for Time Series Forecasting

Learning Values Across Many Orders of Magnitude

Gradient Descent with Early Stopping is Provably Robust to Label Noise for Overparameterized Neural Networks

Bounding the Expected Run-Time of Nonconvex Optimization with Early Stopping

Sharper Rates for General Smooth Convex Functions

Why Random Reshuffling Beats Stochastic Gradient Descent

Random Shuffling Beats SGD After Finite Epochs

Without-Replacement Sampling for Stochastic Gradient Methods

Curiously Fast Convergence of Some Stochastic Gradient Descent Algorithms

tf.data.Dataset. TensorFlow Documentation

Multiple Versus Single Set Validation of Multivariate Models to Avoid Mistakes

Random Forests

Random Forest: A Review

Hyperparameters and Tuning Strategies for Random Forest

On Splitting Training and Validation Set: A Comparative Study of Cross-Validation, Bootstrap and Systematic Sampling for Estimating the Generalization Performance of Supervised Learning

A Scaling Law for the Validation-Set Training-Set Size Ratio

Power Laws, Pareto Distributions and Zipf's Law

Towards Data Science

Machine Learning at Facebook: Understanding Inference at the Edge

Once for All: Train One Network and Specialize it for Efficient Deployment

Optimization of Metascheduler for Cloud Machine Learning Services

Effective Use of the Machine Learning Approaches on Different Clouds

Model-driven Application Refactoring to Minimize Deployment Costs in Preemptible Cloud Resources

Machine Learning Model Deployment Made Simple

What is the State of Neural Network Pruning? arXiv preprint

S. Modeling of Pruning Techniques for Deep Neural Networks Simplification

Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation

Bit Efficient Quantization for Deep Neural Networks

Adaptive Quantization for Deep Neural Network

Quantization Networks

Effective Training of Convolutional Neural Networks with Low-Bitwidth Weights and Activations

Training Quantized Nets: A Deeper Understanding

The Secret to High Performance on Cloud TPUs. Google Cloud

Optimizing Deep Learning Inference on Embedded Systems Through Adaptive Model Selection

How to Optimize Images for Web and Performance

Compression Parameters Tuning for Automatic Image Optimization in Web Applications

Explainable Deep Learning: A Field Guide for the Uninitiated

Explainable Artificial Intelligence: A Systematic Review

Explainable Artificial Intelligence (XAI): Concepts, Taxonomies, Opportunities and Challenges Toward Responsible AI

Explainable Reinforcement Learning: A Survey

DARPA's Explainable Artificial Intelligence Program

Towards Explainable Artificial Intelligence

A Brief Survey of Visual Saliency Detection

Salient Object Detection: A Survey

Review of Visual Saliency Detection with Comprehensive Information

Salient Object Detection: A Benchmark

There and Back Again: Revisiting Backpropagation Saliency Methods

Learning Reliable Visual Saliency for Model Explanations

Why are Saliency Maps Noisy? Cause of and Solution to Noisy Saliency Maps

Grad-CAM: Visual explanations from deep networks via gradient-based localization

Augmented Grad-CAM: Heat-Maps Super Resolution Through Augmentation

An Enhanced Inference Level Visualization Technique for Deep Convolutional Neural Network Models

Grad-Cam++: Generalized Gradient-Based Visual Explanations for Deep Convolutional Networks

Visual Explanation Using Uncertainty Based Class Activation Maps

Saliency Prediction in the Deep Learning Era: Successes and Limitations

Revisiting Video Saliency Prediction in the Deep Learning Era

Adapting Grad-CAM for Embedding Networks

Ablation-CAM: Visual Explanations for Deep Convolutional Network via Gradient-free Localization

Score-CAM: Score-Weighted Visual Explanations for Convolutional Neural Networks

A Scalable Saliency-Based Feature Selection Method with Instance-Level Information. Knowledge-Based Syst

Global Contrast Based Salient Region Detection

Understanding Neural Networks via Feature Visualization: A Survey

Gradient-Free Activation Maximization for Identifying Effective Stimuli

Visualizing Higher-Layer Features of a Deep Network

Inceptionism: Going Deeper into Neural Networks

Visualizing Data Using t-SNE

N log N) Force-Calculation Algorithm

Learning Convolutional Neural Networks with Interactive Visualization

Interactive Visual Learning for Convolutional Neural Networks

Understanding Complex Deep Generative Models Using Interactive Visual Experimentation

Applicability of Machine Learning in Spam and Phishing Email Filtering: Review and Approaches

A Survey of Existing E-Mail Spam Filtering Methods Considering Machine Learning Techniques. Glob

Mining Scientific and Technical Literature: From Knowledge Extraction to Summarization

Rotational Unit of Memory: A Novel Representation Unit for RNNs with Scalable Applications

Scholarcy: The AI-Powered Article Summarizer

Application of Natural Language Processing Algorithms to the Task of Automatic Classification of Russian Scientific Texts

Learning Classifier for Sentence Classification in Biomedical and Computer Science Abstracts

Automated Essay Scoring Based on Two-Stage Learning

Attention-Based Recurrent Convolutional Neural Network for Automatic Essay Scoring

A Neural Approach to Automated Essay Scoring

Automatic Text Scoring Using Neural Networks

Academic Plagiarism Detection: A Systematic Literature Review

Improving Academic Plagiarism Detection for STEM documents by Analyzing Mathematical Content and Citations

Software Plagiarism Detection in Multiprogramming Languages Using Machine Learning Approach

A Machine Learning Framework to Identify Students at Risk of Adverse Academic Outcomes

Generative Deep Learning: Teaching Machines to Paint, Write, Compose, and Play

Deep Learning in the Field of Art

Jukebox: A Generative Model for Music

Deep Learning for Music Generation: Challenges and Directions

Deep Learning Techniques for Music Generation

Better Language Models and Their Implications

Deep Learning for Source Code Modeling and Generation: Models, Applications and Challenges

A Survey of Machine Learning for Big Code and Naturalness

Autocompletion with deep learning

Code Generation Using Transformer

Modeling Clones to Generate Code Predictions

Poisoning Vulnerabilities in Neural Code Completion

Fast and Memory-Efficient Neural Code Completion

When Code Completion Fails: A Case Study on Real-World Completions

Learning to Write Programs

Neural Sketch Learning for Conditional Program Generation

SCIgen -An Automatic CS Paper Generator

A Survey of Deep Learning for Scientific Discovery

New Phenomena in Large-Scale Internet Traffic

Data Mining Approach for Predicting the Daily Internet Data Traffic of a Smart University

Prediction of Academic Performance Associated with Internet Usage Behaviors Using Machine Learning Algorithms

Toward the Quantification of Cognition

An Integrated Brain-Machine Interface Platform with Thousands of Channels

Unsupervised Word Embeddings Capture Latent Knowledge from Materials Science Literature

Neural Networks for Option Pricing and Hedging: A Literature Review

Automated Trading Systems Statistical and Machine Learning Methods and Hardware Implementation: A Survey

Mitigating Bias in Algorithmic Hiring: Evaluating Claims and Practices

Performance Predicting in Hiring Process and Performance Appraisals Using Machine Learning

Bias and Big Data: Artificial Intelligence, Algorithmic Bias and Disparate Impact Liability in Hiring Practices

Reengineering Workplace Bargaining: How Big Data Drives Lower Wages and How Reframing Labor Law can Restore Information Equality in the Workplace

The Effect of Novelty on the Future Impact of Scientific Grants

Adversarial Attacks on Deep-Learning Models in Natural Language Processing: A survey

Understanding Adversarial Attacks on Deep Learning Based Medical Image Analysis Systems

Adversarial examples: Attacks and Defenses for Deep Learning

Threat of Adversarial Attacks on Deep Learning in Computer Vision: A Survey

Towards Understanding the Regularization of Adversarial Robustness on Neural Networks

Certified Robustness to Adversarial Examples with Differential Privacy

Optimal Transport Classifier: Defending Against Adversarial Attacks by Regularized Deep Embedding

Adversarial Examples Improve Image Recognition

Robustness to Adversarial Examples can be Improved With Overfitting

Fixed Smooth Convolutional Layer for Avoiding Checkerboard Artifacts in CNNs

Fashion-MNIST: A Novel Image Dataset for Benchmarking Machine Learning Algorithms

Warwick Electron Microscopy Datasets

Supplementary Information: Warwick Electron Microscopy Datasets. Zenodo

Electron microscopy datasets

An introduction to electron microscopy

Exit wavefunction reconstruction from single transmission electron micrographs with deep learning

Improving electron micrograph signal-to-noise with an atrous convolutional encoder-decoder

Adaptive learning rate clipping stabilizes learning

Deep learning supersampled scanning transmission electron microscopy

Visualization of electron microscopy datasets with deep learning

Recent advances and applications of machine learning in solid-state materials science npj

Big data and deep data in scanning and electron microscopies: deriving functionality from multidimensional data sets Adv

Multilayer feedforward networks are universal approximators Neural Netw

Why does deep and cheap learning work so well?

Model evaluation, model selection and algorithm selection in machine learning

A survey on data collection for machine learning: A big data-AI integration perspective

The crystallographic information file (CIF): A new standard archive file for crystallography Acta Crystallogr. Sect. A: Foundations Crystallogr

Reproducibility Crisis?

EMPIAR: A public archive for raw electron microscopy image data

Machine learning and big scientific data Philosophical Trans

The CIFAR-10 dataset

Learning multiple layers of features from tiny images

MNIST handwritten digit database. AT&T Labs (Available online at

ImageNet large scale visual recognition challenge

A global geometric framework for nonlinear dimensionality reduction

Nonlinear dimensionality reduction by locally linear embedding

MLLE: Modified locally linear embedding using multiple weights

Hessian eigenmaps: locally linear embedding techniques for high-dimensional data Proc

Laplacian eigenmaps for dimensionality reduction and data representation Neural Comput

Principal manifolds and nonlinear dimensionality reduction via tangent space alignment SIAM

Data visualization with multidimensional scaling

Accelerating t-SNE using tree-based algorithms

Visualizing data using t-SNE

An overview of gradient descent optimization algorithms

Intrinsic t-stochastic neighbor embedding for visualization and outlier detection Int

Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions SIAM Rev

A randomized algorithm for the decomposition of matrices

Singular value decomposition and principal component analysis A Practical Approach to Microarray Data Analysis

Principal component analysis: A review and recent developments

Histograms of oriented gradients for human detection

Up robust features (SURF) Comput. Vis. Image Underst

Multiresolution gray-scale and rotation invariant texture classification with local binary pattern

A Theory for multiresolution signal decomposition: The wavelet representation

Content-based image retrieval and feature extraction: a comprehensive review Math

Automatic choice of dimensionality for PCA Adv Neural Inf Process Syst

scikit-image: image processing in python

Recent advances in autoencoder-based representation learning

Nonlinear principal component analysis using autoassociative neural networks

Transfer learning from pre-trained models Towards data science

MLPs and autoencoders

Autoencoders, kernels and multilayer perceptrons for electron micrograph restoration and compression

Auto-encoding variational Bayes

An introduction to variational autoencoders

Convolutional neural networks for inverse problems in imaging: A review

ImageNet classification with deep convolutional neural networks Adv Neural Inf Process Syst

Batch normalization: accelerating deep network training by reducing internal covariate shift

Rectified linear units improve restricted Boltzmann machines Proc. of the 27th Int. Conf. on Machine Learning

beta-VAE: learning basic visual concepts with a constrained variational framework Int. Conf. on Learning Representations

Parameter tuning is a key part of dimensionality reduction via deep variational autoencoders for single cell

Estimates of edge detection filters in human vision Vis

Autoencoding beyond pixels using a learned similarity metric

Edge detection of images using Sobel operator Int

ADAM: A method for stochastic optimization

Stochastic gradient descent optimizes over-parameterized deep ReLU networks

The step decay schedule: a near optimal

Decaying momentum helps neural network training

How to tune hyperparameters of tSNE Towards Data Science

scikit-learn: machine learning in python

Unscrambling mixed elements using high angle annular dark field scanning transmission electron microscopy

Sample tilt effects on atom column position determination in ABF-STEM imaging Ultramicroscopy

Surface engineering of hierarchical platinum-cobalt nanowires for efficient electrocatalysis

Effect of layer thickness on the mechanical behaviour of oxidation-strengthened Zr/Nb nanoscale multilayers

Atomic-Level imaging of Mo-V-O complex oxide phase intergrowth, grain boundaries and defects using HAADF-STEM Proc

Contamination of holey/lacey carbon films in STEM Micron

Theoretical framework of statistical noise in scanning transmission electron microscopy Ultramicroscopy

Sampling, data transmission and the Nyquist rate Proc. of the IEEE

Gatan microscopy suite 2019 (Available online at: www.gatan.com/products/tem-analysis/gatan-microscopy-suite-software)

NPY format 2019

NEP 1-a simple file format for NumPy arrays

Thickness measurements of lacey carbon films

1D vs. 2D Shape selectivity in the crystallization-driven self-assembly of polylactide block copolymers

Electron diffraction using transmission electron microscopy

Superconducting MgB2 nanowires

The microstructural characterization of multiferroic LaFeO3-YMnO3 multilayers grown on (001)-and (111)-SrTiO3 substrates by transmission electron microscopy Materials

Individual particles of cryoconite deposited on the mountain glaciers of the Tibetan Plateau: Insights into chemical composition and sources

Advanced computing in electron microscopy

Using SMILES strings for the description of chemical connectivity in the

COD:: CIF::Parser: An error-correcting CIF parser for the Perl language

Computing stoichiometric molecular composition from crystal structures

Crystallography Open Database (COD): An open-access collection of crystal structures and platform for world-wide collaboration

Crystallography Open Database -an open-access collection of crystal structures

RFC1738: Uniform Resource Locators (URL) RFC

ISO/IEC JTC 1/SC 22 2017 International standard ISO/IEC21778: information technology -the JSON data interchange syntax

Cooling of melts: kinetic stabilization and polymorphic transitions in the KInSnSe4 System

Adobe developers association 1992 et al TIFF Revision 6.0. (Available online at

Recording low and high spatial frequencies in exit wave reconstructions

Advances in Computational Methods for Transmission Electron Microscopy Simulation and Image Processing Ph

Unpaired image-to-image translation using cycle-consistent adversarial networks

MicroImages 2010 Resampling methods. technical Guide

Sub-Nyquist artefacts and sampling Moiré effects

Open data science 2019 How to fix data leakage -your model's greatest enemy. towards data science

Convergent-beam electron diffraction

Deep hashing based on VAE-GAN for efficient similarity retrieval Chin

Learning latent subspaces in variational autoencoders

Unsupervised anomaly detection using variational auto-encoder based feature extraction

Unsupervised anomaly detection via variational auto-encoder for seasonal KPIs in web applications

Scikit-learn: Machine Learning in Python

Visualization of Electron Microscopy Datasets with Deep Learning

Exit Wavefunction Reconstruction from Single Transmission Electron Micrographs with Deep Learning

Xception: Deep Learning with Depthwise Separable Convolutions

A Trivariate Clough-Tocher Scheme for Tetrahedral Data

An overview of gradient descent optimization algorithms

Stochastic gradient descent optimizes over

Catastrophic forgetting: still a problem for DNNs Int

Deep learning for pedestrians: backpropagation in CNNs

How convolutional neural network see the world-A survey of convolutional neural network visualization methods

Dynamic routing between capsules Advances in Neural Information Processing Systems pp

On the difficulty of training recurrent neural networks

Statistical language models based on neural networks PhD thesis

An alternative probabilistic interpretation of the Huber loss

Batch normalization accelerating deep network training by reducing internal covariate shift

Learning multiple layers of features from tiny images

Convolutional neural networks for inverse problems in imaging: A review

ImageNet classification with deep convolutional neural networks Advances in Neural Information Processing Systems

Rectified linear units improve restricted Boltzmann machines in Proc. of the 27th Int. Conf. on Machine Learning

Understanding the difficulty of training deep feedforward neural networks Proc. of the Thirteenth Int. Conf. on Artificial Intelligence and Statistics pp

ADAM: A method for stochastic optimization

Partial scanning transmission electron microscopy with deep learning

Going deeper with convolutions Proc. of the Conf. on Computer Vision and Pattern Recognition pp

Rethinking the inception architecture for computer vision

Weight normalization: A simple reparameterization to accelerate training of deep neural networks Advances in Neural Information Processing Systems pp

Norm matters: efficient and accurate normalization schemes in deep networks

Rethinking atrous convolution for semantic image segmentation

Human-Level control through deep reinforcement learning

Tensor flow: A system for large-scale machine learning

Deep residual learning for image recognition Proc. of the Conf. on Computer Vision and Pattern Recognition pp

Supplementary Information: Partial Scanning Transmission Electron Microscopy with Deep Learning

High-Precision Scanning Transmission Electron Microscopy at Coarse Pixel Sampling for Reduced Electron Dose

Polarization Curling and Flux Closures in Multiferroic Tunnel Junctions

Suppressing Electron Exposure Artifacts: An Electron Scanning Paradigm with Bayesian Machine Learning

Radiation Damage in the TEM and SEM

Managing Dose-, Damage-and Data-Rates in Multi-Frame Spectrum-Imaging

How Should a Fixed Budget of Dwell Time be Spent in Scanning Electron Microscopy to Optimize Image Quality?

Sparse Imaging for Fast Electron Microscopy

The Potential for Bayesian Compressive Sensing to Significantly Reduce Electron Dose in High-Resolution STEM Images

A Sub-Sampled Approach to Extremely Low-Dose STEM

Towards the Low-Dose Characterization of Beam Sensitive Nanostructures via Implementation of Sparse Image Acquisition in Scanning Transmission Electron Microscopy

Sparsity and Incoherence in Compressive Sampling

Implementing an Accurate and Rapid Sparse Sampling Approach for Low-Dose Atomic Resolution STEM Imaging

Dynamic Scan Control in STEM: Spiral Scans

Development of a Fast Electromagnetic Beam Blanker for Compressed Sensing in Scanning Transmission Electron Microscopy

Compressed Sensing of Scanning Transmission Electron Microscopy (STEM) with Nonrectangular Scans

Precision Controlled Atomic Resolution Scanning Transmission Electron Microscopy using Spiral Scan Pathways

Theoretical Framework of Statistical Noise in Scanning Transmission Electron Microscopy

Deep Portrait Image Completion and Extrapolation

Image Inpainting for Irregular Holes using Partial Convolutions

Deep Learning for Single Image Super-Resolution: A Brief Review

Deep Learning-Based Point-Scanning Super-Resolution Imaging

Deep Learning Supersampled Scanning Transmission Electron Microscopy

A Survey on Deep Transfer Learning

Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning

A Survey on Data Collection for Machine Learning: A Big Data-AI Integration Perspective

The CIFAR-10 Dataset

Learning Multiple Layers of Features from Tiny Images

ImageNet Large Scale Visual Recognition Challenge

Generative Adversarial Nets

Solving Mode Collapse using Manifold Guided Training

High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs

Autoencoding Beyond Pixels using a Learned Similarity Metric

Sampling, Data Transmission, and the Nyquist Rate

Adobe Developers Association et al. TIFF Revision 6.0

Correcting Nonlinear Drift Distortion of Scanning Probe and Scanning Transmission Electron Microscopies from Image Pairs with Orthogonal Scan Directions

Revolving Scanning Transmission Electron Microscopy: Correcting Sample Drift Distortion Without Prior Knowledge

ISTEM: A Realisation of Incoherent Imaging for Ultra-High Resolution TEM Beyond the Classical Information Limit

Conditions and Reasons for Incoherent Imaging in STEM

Exploring Generalization in Deep Learning

Deep Learning

An Overview of Gradient Descent Optimization Algorithms

TensorFlow: A System for Large-Scale Machine Learning

Convolutional Neural Networks for Inverse Problems in Imaging: A Review

ImageNet Classification with Deep Convolutional Neural Networks

Going Deeper with Convolutions

Rethinking the Inception Architecture for Computer Vision

Image-to-Image Translation with Conditional Adversarial Networks

A Method for Stochastic Optimization

Stochastic Gradient Descent Optimizes Over-Parameterized Deep ReLU Networks

Adaptive Learning Rate Clipping Stabilizes Learning

Generative Adversarial Networks: A Survey and Taxonomy

Towards a Deeper Understanding of Adversarial Losses

Spectral Normalization for Generative Adversarial Networks

Least Squares Generative Adversarial Networks

Image Quality Assessment: From Error Visibility to Structural Similarity

Improving Electron Micrograph Signal-to-Noise with an Atrous Convolutional Encoder-Decoder

Low Dose Scanning Transmission Electron Microscopy of Organic Crystals by

Cs-Corrected STEM Imaging of Both Pure and Silver-Supported Metal-Organic Framework MIL-100 (Fe)

Quantification and Optimization of ADF-STEM Image Contrast for Beam-Sensitive Materials

Cryo-Analytical STEM of Frozen

Situ Ferroelectric Domain Dynamics Probed with Differential Phase Contrast Imaging

Static and Dynamic Polar Nanoregions in Relaxor Ferroelectric Ba(Ti 1−x Sn x )O 3 System at High Temperature

Tracking Iridium Atoms with Electron Microscopy: First Steps of Metal Nanocluster Formation in One-Dimensional Zeolite Channels

Tracking Metal Electrodeposition Dynamics from Nucleation and Growth of a Single Atom to a Crystalline Nanoparticle

Atomic Structure and Migration Dynamics of MoS 2 /Li x MoS 2 Interface

Deep Learning of Atomically Resolved Scanning Transmission Electron Microscopy Images: Chemical Identification and Tracking Local Transformations

Identifying and Correcting Scan Noise and Drift in the Scanning Transmission Electron Microscope

Improved Techniques for Training GANs

Generative Adversarial Network Training is a Continual Learning Problem

Multilayer Feedforward Networks are Universal Approximators

Why does Deep and Cheap Learning Work so Well?

AdaNet: A Scalable and Flexible Framework for Automatically Learning Ensembles

A Type-Based Declarative Deep Learning Toolbox

AutoML: A Survey of the State-of-the-Art

A Method for Stochastic Optimization

Spectral Normalization for Generative Adversarial Networks

High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs

Adaptive Learning Rate Clipping Stabilizes Learning

Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks

Norm Matters: Efficient and Accurate Normalization Schemes in Deep Networks

Rethinking Atrous Convolution for Semantic Image Segmentation

Wasserstein Generative Adversarial Networks

Rectified Linear Units Improve Restricted Boltzmann Machines

Rectifier Nonlinearities Improve Neural Network Acoustic Models

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Generative Adversarial Network Training is a Continual Learning Problem

Connecting Generative Adversarial Networks and Actor-Critic Methods

Learning from Simulated and Unsupervised Images through Adversarial Training

Going Deeper with Convolutions

Rethinking the Inception Architecture for Computer Vision

Deep Residual Learning for Image Recognition

Image Restoration using Convolutional Auto-encoders with Symmetric Skip Connections

Adversarial Signal Denoising with Encoder-Decoder Networks

SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation

Degeneration in VAE

Spatially-Sparse Convolutional Neural Networks

Why does Deep and Cheap Learning Work so Well?

Theoretical Framework of Statistical Noise in Scanning Transmission Electron Microscopy

On the Importance of Initialization and Momentum in Deep Dearning

A Method of Solving a Convex Programming Problem with Convergence Rate O(1/k 2 )

Neural Networks for Machine Learning Lecture 6a Overview of Mini-Batch Gradient Descent

Artificial-Intelligence-Driven Scanning Probe Microscopy

Atomic Force Microscopy

Computerized Axial Tomography with the EMI Scanner

Accurate Measurement of Liver, Kidney, and Spleen Volume and Mass by Computerized Axial Tomography

Scanning Electron Microscopy: An Introduction. III-Vs Rev

Noninvasive Molecular Imaging of Small Living Subjects Using Raman Spectroscopy

Scanning Transmission Electron Microscopy: A Review of High Angle Annular Dark Field and Annular Bright Field Imaging and Applications in Lithium-Ion Batteries

Dynamic X-Ray Diffraction Sampling for Protein Crystal Positioning

Suppressing Electron Exposure Artifacts: An Electron Scanning Paradigm with Bayesian Machine Learning

Radiation Damage in the TEM and SEM

Warwick Electron Microscopy Datasets

Sub-Nyquist Artefacts and Sampling Moiré Effects

Compressed Sensing and Electron Microscopy

Deep Learning in Electron Microscopy

Partial Scanning Transmission Electron Microscopy with Deep Learning

Deep Reinforcement Learning: An Overview

Towards the Low-Dose Characterization of Beam Sensitive Nanostructures via Implementation of Sparse Image Acquisition in Scanning Transmission Electron Microscopy

Suppressing Electron Exposure Artifacts: An Electron Scanning Paradigm with Bayesian Machine Learning

Sparse Imaging for Fast Electron Microscopy

Deep Learning-Based Point-Scanning Super-Resolution Imaging

Deep Learning Supersampled Scanning Transmission Electron Microscopy

Selection of Optimal Views for Computed Tomography Reconstruction

Variable Density Compressed Image Sampling

Compressed Sensing and Bayesian Experimental Design

Info-Greedy Sequential Adaptive Compressed Sensing

Communications-Inspired Projection Design with Application to Compressive Sensing

A Framework for Dynamic Image Sampling Based on Supervised Learning

Sparse Fast Fourier Transform for Exactly Sparse Signals and Signals with Additive Gaussian Noise. Signal, Image Video Process

Asymptotic Optimality of Finite Model Approximations for Partially Observed Markov Decision Processes With Discounted Cost

Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems

Dynamic Scan Control in STEM: Spiral Scans

Precision Controlled Atomic Resolution Scanning Transmission Electron Microscopy Using Spiral Scan Pathways

Long Short-Term Memory

Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation

On the Practical Computational Power of Finite Precision RNNs for Language Recognition

An Empirical Exploration of Recurrent Network Architectures

Evolving Memory Cell Structures for Sequence Learning

On the Difficulty of Training Recurrent Neural Networks

Hybrid Computing Using a Neural Network with Dynamic External Memory

Backpropagation Through Time: What It Does and How To Do It

An Overview of Gradient Descent Optimization Algorithms

Recurrent Models of Visual Attention

Multiple Object Recognition with Visual Attention

AlphaStar: Mastering the Real-Time Strategy Game StarCraft II

Continuous Control with Deep Reinforcement Learning

Memory-Based Control with Recurrent Neural Networks

Actor-Critic Algorithms

Analysis and Improvement of Policy Gradient Estimation

Simulated Annealing Algorithm for Deep Learning

Optimizing Deep Learning Hyper-Parameters Through an Evolutionary Algorithm

Deep Neuroevolution: Genetic Algorithms are a Competitive Alternative for Training Deep Neural Networks for Reinforcement Learning

TensorFlow: A System for Large-Scale Machine Learning

Adaptive Partial STEM Repository

On the Theory of the Brownian Motion

Parameter Space Noise for Exploration

Addressing Function Approximation Error in Actor-Critic Methods

Adaptive Learning Rate Clipping Stabilizes Learning

Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning

A Survey on Data Collection for Machine Learning: A Big Data-AI Integration Perspective

Convolutional Neural Networks for Inverse Problems in Imaging: A Review

ImageNet Classification with Deep Convolutional Neural Networks

Human-Level Control Through Deep Reinforcement Learning

A Method for Stochastic Optimization

Cyclical Learning Rates for Training Neural Networks

Improving Electron Micrograph Signal-to-Noise with an Atrous Convolutional Encoder-Decoder

Learning Values Across Many Orders of Magnitude

Combating Reinforcement Learning's Sisyphean Curse with Intrinsic Fear

A Reinterpretation of the Policy Oscillation Phenomenon in Approximate Policy Iteration

Long Short-Term Memory Projection Recurrent Neural Network Architectures for Piano's Continuous Note Recognition

Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping

AdaNet: A Scalable and Flexible Framework for Automatically Learning Ensembles

A Type-Based Declarative Deep Learning Toolbox

AutoML: A Survey of the State-of-the-Art

Modeling Neural Architecture Search Methods for Deep Networks

Reinforcement Learning for Neural Architecture Search: A Review

Neural Architecture Search: A Survey

Automated Machine Learning: Review of the State-of-the-Art and Opportunities for Healthcare

Outperform Humans? An Evaluation on Popular OpenML Datasets Using AutoML Benchmark

Learning Transferable Architectures for Scalable Image Recognition

Correcting Nonlinear Drift Distortion of Scanning Probe and Scanning Transmission Electron Microscopies from Image Pairs with Orthogonal Scan Directions

Scanning Distortion Correction in STEM Images

Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks

Supplementary Information: Adaptive Partial Scanning Transmission Electron Microscopy with Reinforcement Learning

Theoretical Framework of Statistical Noise in Scanning Transmission Electron Microscopy

High-Quality Self-Supervised Deep Image Denoising

Real-Time Data Processing Using Python in DigitalMicrograph

Enabling Flexible FPGA High-Level Synthesis of TensorFlow Deep Neural Networks

A Reinforcement Learning Based Markov-Decision Process (MDP) Implementation for SRAM FPGAs

Hybrid Computing Using a Neural Network with Dynamic External Memory

An Empirical Exploration of Recurrent Network Architectures

Non-Zero Initial States for Recurrent Neural Networks

A Guide to Convolution Arithmetic for Deep Learning

Deep Residual Learning for Image Recognition

Rectified Linear Units Improve Restricted Boltzmann Machines

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Understanding the Difficulty of Training Deep Feedforward Neural Networks

Regularization for Deep Learning: A Taxonomy

Why Gradient Clipping Accelerates Training: A Theoretical Justification for Adaptivity

Stochastic Optimization with Heavy-Tailed Noise via Accelerated Gradient Clipping

Understanding Gradient Clipping in Private SGD: A Geometric Perspective

Can Gradient Clipping Mitigate Label Noise?

Cyclical Learning Rates for Training Neural Networks

Warwick Electron Microscopy Datasets

Edge Detection of Images Using Sobel Operator

Loss Functions for Neural Networks for Image Processing

A Review on Generative Adversarial Networks: Algorithms, Theory, and Applications

Generative Adversarial Networks (GANs): Challenges, Solutions, and Future Directions

Recent Progress on Generative Adversarial Networks (GANs): A Survey

Generative Adversarial Networks: A Survey and Taxonomy

Partial Scanning Transmission Electron Microscopy with Deep Learning

Improving Image Autoencoder Embeddings with Perceptual Loss

Adaptive Learning Rate Clipping Stabilizes Learning

Long Short-Term Memory Projection Recurrent Neural Network Architectures for Piano's Continuous Note Recognition

An Overview of Gradient Descent Optimization Algorithms

Memory-Based Control with Recurrent Neural Networks

Gradient Descent with Early Stopping is Provably Robust to Label Noise for Overparameterized Neural Networks

Bounding the Expected Run-Time of Nonconvex Optimization with Early Stopping

Dynamic Scan Control in STEM: Spiral Scans

Joint Denoising and Distortion Correction for Atomic Column Detection in Scanning Transmission Electron Microscopy Images

Correction of Image Drift and Distortion in a Scanning Electron Microscopy

Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks

Improving Electron Micrograph Signal-to-Noise with an Atrous Convolutional Encoder-Decoder

Practical method for noise removal in scanning electron microscopy

The impact of stem aberration correction on materials science

Chromatic aberration correction for atomic resolution tem imaging from 20 to 80 kv

Development of tem and sem high brightness electron guns using cold-field emission from a carbon nanotip

Aberration corrected 1.2-mv cold field-emission transmission electron microscope with a sub-50-pm resolution

Development of a sem-based low-energy in-line electron holography microscope for individual particle imaging

Towards full-resolution inline electron holography

Ultrafast transmission electron microscopy using a laser-driven field emitter: femtosecond resolution with a high coherence electron beam

Rapid low dose electron tomography using a direct electron detection camera

Electron ptychography of 2d materials to deep sub-ångström resolution

Analysis of global and site-specific radiation damage in cryo-em

Survey of image denoising techniques

Cryo-electron microscopy data denoising based on the generalized digitized total variation method

De-noising filters for tem (transmission electron microscopy) image of nanomaterials

Multilayer feedforward networks are universal approximators

Why does deep and cheap learning work so well?

Representation learning: a review and new perspectives

Low-dose x-ray tomography through a deep convolutional neural network

Deep convolutional denoising of low-light images

Image restoration using convolutional auto-encoders with symmetric skip connections

Beyond a Gaussian denoiser: residual learning of deep cnn for image denoising

A deep convolutional neural network to analyze position averaged convergent beam electron diffraction patterns

Deep neural networks segment neuronal membranes in electron microscopy images

A deep convolutional neural network approach to single-particle recognition in cryo-electron microscopy

A guide to convolution arithmetic for deep learning

Discrete fourier transform computation using neural networks

Deep learning in neural networks: an overview

Cudnn: efficient primitives for deep learning

A review of convolutional neural networks for inverse problems in imaging

A survey of deep neural network architectures and their applications

Revisiting distributed synchronous sgd

Tensorflow: a system for large-scale machine learning

Low dose ct image denoising using a generative adversarial network with wasserstein distance and perceptual loss

Rethinking atrous convolution for semantic image segmentation

Encoder-decoder with atrous separable convolution for semantic image segmentation

Improving electron micrograph signal-to-noise with an atrous convolutional encoder-decoder

Xception: deep learning with depthwise separable convolutions

Scientific charge-coupled devices

Adam: a method for stochastic optimization

Comparison of optimal performance at 300kev of three direct electron detectors for use in low dose electron microscopy

Robust estimation of a location parameter

Convolutional deep belief networks on cifar-10

Understanding the difficulty of training deep feedforward neural networks

Regularization for deep learning: a taxonomy

Weight normalization: A simple reparameterization to accelerate training of deep neural networks

Batch normalization: accelerating deep network training by reducing internal covariate shift

Bandwidth selection in kernel density estimation: a review

Bandwidth selection for kernel conditional density estimation

Multivariate Density Estimation: Theory, Practice, and Visualization

The opencv library, Dr. Dobb's

Bilateral filtering for gray and color images, Computer Vision

SciPy: Open source scientific tools for Python

Scikit-image: image processing in python

Adaptive wavelet thresholding for image denoising and compression

Ideal spatial adaptation by wavelet shrinkage

An algorithm for total variation minimization and applications

The split bregman method for l1-regularized problems

Rudin-osher-fatemi total variation denoising using split bregman

How does batch normalization help optimization? (no, it is not about internal covariate shift

Contrast limited adaptive histogram equalization

Image quality assessment: from error visibility to structural similarity

Deconvolution and checkerboard artifacts

Super-resolution using convolutional neural networks without any checkerboard artifacts

Exit Wavefunction Reconstruction from Single Transmission Electron Micrographs with Deep Learning

Supplementary Information: Exit Wavefunction Reconstruction from Single Transmission Electron Micrographs with Deep Learning

Exit Wavefunction Reconstruction from Single Transmission Electron Micrographs with Deep Learning

Tutorial on off-axis electron holography

Youngs double-slit interference experiment with electrons

An experiment on electron wave-particle duality including a Planck constant measurement

Reconstruction of the projected crystal potential in transmission electron microscopy by means of a maximum-likelihood refinement algorithm

Measuring the mean inner potential of Al 2 O 3 sapphire using off-axis electron holography

Applications of electron holography

Correction of aberrations of an electron microscope by means of electron holography

Absolute measurement of normalized thickness, t/λ i , from off-axis electron holography

Observation of the magnetic flux and three-dimensional structure of skyrmion lattices by electron holography

Off-axis electron holography of magnetic nanowires and chains, rings, and planar arrays of magnetic nanoparticles

Direct electron detectors

Detective quantum efficiency of electron area detectors in electron microscopy

Transmission electron microscopy: Diffraction, imaging, and spectrometry

Scanning transmission electron microscopy: imaging and analysis

Scanning electron microscopy and X-ray microanalysis

On Abbe's theory of image formation in the microscope

Fundamentals of focal series inline electron holography

Off-axis and inline electron holography: A quantitative comparison

Towards full-resolution inline electron holography

Recording low and high spatial frequencies in exit wave reconstructions

Hybridization approach to in-line and off-axis (electron) holography for superior resolution and phase sensitivity

Quantitative characterization of electron detectors for transmission electron microscopy

Phase recovery and holographic image reconstruction using deep learning in neural networks

Direct exit-wave reconstruction from a single defocused image

Direct retrieval of a complex wave from its diffraction pattern

Warwick electron microscopy datasets

Advances in computational methods for transmission electron microscopy simulation and image processing

The crystallographic information file (CIF): a new standard archive file for crystallography

Using SMILES strings for the description of chemical connectivity in the Crystallography Open Database

COD::CIF::Parser: an error-correcting CIF parser for the Perl language

Computing stoichiometric molecular composition from crystal structures

Crystallography Open Database (COD): an open-access collection of crystal structures and platform for world-wide collaboration

Crystallography Open Database -an open-access collection of crystal structures

The American Mineralogist crystal structure database

Advanced computing in electron microscopy

OpenCL: A parallel programming standard for heterogeneous computing systems

The FFT on a GPU

Handbook of mathematical functions

Cooling of melts: Kinetic stabilization and polymorphic transitions in the KInSnSe 4 system

Convolutional neural networks for inverse problems in imaging: A review

ImageNet classification with deep convolutional neural networks

Batch normalization: Accelerating deep network training by reducing internal covariate shift

Deep residual learning for image recognition

Rectified linear units improve restricted Boltzmann machines

Rectifier nonlinearities improve neural network acoustic models

Dying ReLU and initialization: Theory and numerical examples

Why ReLU units sometimes die: Analysis of single-unit error backpropagation in neural networks

Empirical evaluation of rectified activations in convolutional network

Encoder-decoder with atrous separable convolution for semantic image segmentation

ADAM: A method for stochastic optimization

An overview of gradient descent optimization algorithms

Stochastic gradient descent optimizes over-parameterized deep relu networks

Adaptive learning rate clipping stabilizes learning

The step decay schedule: A near optimal, geometrically decaying learning rate procedure

Advances in neural information processing systems

Generative adversarial networks: A survey and taxonomy

Towards a deeper understanding of adversarial losses

Spectral normalization for generative adversarial networks

Conditional generative adversarial nets

Deep generative image models using a Laplacian pyramid of adversarial networks

Generative adversarial text to image synthesis

Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks

Invertible conditional GANs for image editing

Temporal generative adversarial nets with singular value clipping

Adversarially learned inference

Semi-supervised conditional gans

cgans with projection discriminator

Generative adversarial network training is a continual learning problem

Accuracy of an automatic diffractometer. Measurement of the sodium chloride structure factors

LaBeB 3 O 7 : A new phase-matchable nonlinear optical crystal exclusively containing the tetrahedral XO 4 (X=B and Be) anionic groups

Relation between the crystal structure, physical properties and ferroelectric properties of PbZr x Ti 1−x O 3 (x=0.40, 0.45, 0.53) ferroelectric material by heat treatment

PZT ceramics fabricated based on stereolithography for an ultrasound transducer array application

metal-ferroelectric-metal capacitor and its application for IR sensor

CdTe solar cells with open-circuit voltage breaking the 1 V barrier

AdaNet: A scalable and flexible framework for automatically learning ensembles

Ludwig: a type-based declarative deep learning toolbox

AutoML: A survey of the state-of-the-art

One shot exit wavefunction reconstruction

Tensorflow: A system for large-scale machine learning

Improving electron micrograph signal-to-noise with an atrous convolutional encoder-decoder

Characterization and dynamic manipulation of graphene by in situ transmission electron microscopy at atomic scale, Handbook of Graphene: Physics

Electron-driven in situ transmission electron microscopy of 2D transition metal dichalcogenides and their 2D heterostructures

Atomic-resolution transmission electron microscopy of electron beam-sensitive crystalline materials

Application of conventional electron microscopy in aquatic animal disease diagnosis: A review

Transmission electron microscopy of cellulose. Part 2: technical and practical aspects

Learning visual reasoning without strong priors

Gatan microscopy suite, online: www.gatan.com/pro ducts/tem-analysis/gatan-microscopy-suitesoftware

Unsupervised word embeddings capture latent knowledge from materials science literature

QuCumber: Wavefunction reconstruction with neural networks

NetKet: A machine learning toolkit for many-body quantum

Advances in neural information processing systems

Large scale GAN training for high fidelity natural image synthesis

Fig. S1: Input amplitudes, target phases and output phases of 224×224 multiple material training set wavefunctions for unseen flips, rotations and translations

Review: Deep Learning in Electron Microscopy

Warwick Electron Microscopy Datasets

Adaptive Learning Rate Clipping Stabilizes Learning

Adaptive Partial Scanning Transmission Electron Microscopy with Reinforcement Learning

Improving Electron Micrograph Signal-to-Noise with an Atrous Convolutional Encoder-Decoder

Exit Wavefunction Reconstruction from Single Transmission Electron Micrographs with Deep Learning

Resume of Jeffrey Mark Ede

Supplementary Information: Warwick Electron Microscopy Datasets. Zenodo

Supplementary Information: Partial Scanning Transmission Electron Microscopy with Deep Learning

Supplementary Information: Adaptive Partial Scanning Transmission Electron Microscopy with Reinforcement Learning

Supplementary Information: Exit Wavefunction Reconstruction from Single Transmission Electron Micrographs with Deep Learning

Source Code for Warwick Electron Microscopy Datasets

Warwick Electron Microscopy Datasets Archive

Adaptive Learning Rate Clipping Stabilizes Learning

Source Code for Adaptive Learning Rate Clipping Stabilizes Learning

Partial Scanning Transmission Electron Microscopy with Deep Learning

Deep Learning Supersampled Scanning Transmission Electron Microscopy

Source Code for Partial Scanning Transmission Electron Microscopy

Source Code for Deep Learning Supersampled Scanning Transmission Electron Microscopy

Source Code for Adaptive Partial Scanning Transmission Electron Microscopy with Reinforcement Learning

Improving Electron Micrograph Signal-to-Noise with an Atrous Convolutional Encoder-Decoder

Source Code for Improving Electron Micrograph Signal-to-Noise with an Atrous Convolutional Encoder-Decoder

Source Code for Exit Wavefunction Reconstruction from Single Transmission Electron Micrographs with Deep Learning

Progress Reports of Jeffrey Mark Ede: 0.5 Year Progress Report

Source Code for Beanland Atlas

Thesis Word Counting. Zenodo

Autoencoders, Kernels, and Multilayer Perceptrons for Electron Micrograph Restoration and Compression

Source Code for Autoencoders, Kernels, and Multilayer Perceptrons for Electron Micrograph Restoration and Compression

Source Code for Simple Webserver

Advances in Electron Microscopy with Deep Learning

Application of Novel Computing and Data Analysis Methods in Electron Microscopy. UK Research and Innovation

Online: https: //gow.epsrc.ukri.org/NGBOViewGrant

Structure Refinement from 'Digital' Large Angle Convergent Beam Electron Diffraction Patterns

Beanland Atlas Repository

Electron-Beam-Induced Ferroelectric Domain Behavior in the Transmission Electron Microscope: Toward Deterministic Domain Patterning

Neural Network Generative Art in Javascript

/19/neural-network-generative-art

Generate Abstract Random Art with A Neural Network

Image Style Transfer Using Convolutional Neural Networks

A Neural Algorithm of Artistic Style

Perceptual Losses for Real-Time Style Transfer and Super-Resolution

High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

An End-to-End Compression Framework Based on Convolutional Neural Networks

Connecting the Dots: Writing a Doctoral Thesis by Publication

Institutional and Supervisory Support for the Thesis by Publication

The Declining Scientific Impact of Theses: Implications for Electronic Thesis and Dissertation Repositories and Graduate Studies

The State of the Art in Peer Review

Emerging Trends in Peer Review -A Survey

Peer Reviewers Unmasked: Largest Global Survey Reveals Trends

On Performance of Peer Review for Academic Journals: Analysis Based on Distributed Parallel System

Providing Open Access to PhD Theses: Visibility and Citation Benefits. Program

Ways of Disseminating, Tracking Usage and Impact of Electronic Theses and Dissertations (ETDs)

The Making of Knowledge-Makers in Composition: A Distant Reading of Dissertations

ArXiv at 20

Introduction to LATEX and to Some of its Tools. ArsTEXnica

Pimp Your Thesis: A Minimal Introduction to LATEX

LATEX: A Document Preparation System: User's Guide and Reference Manual

Rise of the Rxivs: How Preprint Servers are Changing the Publishing Process

Praise of Preprints

Preprints and Preprint Servers as Academic Communication Tools. Revista Cubana de Información en Ciencias de la Salud

Answers to 18 Questions About Open Science Practices

The Relationship Between bioRxiv Preprints, Citations and Altmetrics

The Impact of Preprints in Library and Information Science: An Analysis of Citations, Usage and Social Attention Indicators

Open Access to Scholarly Communications: Advantages, Policy and Advocacy. Acceso Abierto a la información en las Bibliotecas Académicas de América Latina y el Caribe

Meta-Research: Releasing a Preprint is Associated with More Attention and Citations for the Peer-Reviewed Article

Open Access Meets Discoverability: Citations to Articles Posted to Academia

Comparing Published Scientific Journal Articles to Their Pre-Print Versions

Comparing Quality of Reporting Between Preprints and Peer-Reviewed Articles in the Biomedical Literature

Understanding the Importance of Copyediting in Peer-Reviewed Manuscripts

Document management -Portable document format -Part 2: PDF 2.0. International Organization for Standardization

ISO 32000-2:2008 Document management -Portable document format -Part 1: PDF 1.7. Adobe Systems

arXiv License Information

On the Use of ArXiv as a Dataset

What is Open Peer Review? A Systematic Review

Open Scholarship and Peer Review: a Time for Experimentation

A Conceptual Peer Review Model for arXiv and Other Preprint Databases

A Brief History of Physics Reviews

A Review of Deep Learning with Special Emphasis on Architectures, Applications and Recent Trends. Knowledge-Based Systems

Review of Deep Learning Algorithms and Architectures

A State-of-the-Art Survey on Deep Learning Theory and Architectures

Deep Learning

Deep Learning in Neural Networks: An Overview

On the Use of Deep Learning for Computational Imaging

Deep Learning Analysis on Microscopic Imaging in Materials Science. Materials Today Nano

From DFT to Machine Learning: Recent Approaches to Materials Science -A Review

An Introduction to Variational Autoencoders

Tutorial on Variational Autoencoders

Visualizing Data Using t-SNE

Clustering with t-SNE

Accelerating t-SNE Using Tree-Based Algorithms

Reproducibility Crisis

Translated by Peng Ping

RFC1738: Uniform Resource Locators (URL). RFC

Fibre Optic Communication in 21 st Century

The History of Broadband

Ultra-Fast Broadband Investment and Adoption: A Survey

Connecting the Unconnected': A Critical Assessment of US

Performance Evaluation of Internet over Geostationary Satellite for Industrial Applications

Google Unveils Search Engine for Open Data

Discovering Millions of Datasets on the Web. The Keyword

Revisiting Unreasonable Effectiveness of Data in Deep Learning Era

Machine Learning and Big Scientific Data

The Kinetics Human Action Video Dataset

ImageNet Large Scale Visual Recognition Challenge

YouTube-8M: A Large-Scale Video Classification Benchmark

World Health Organization Declares Global Emergency: A Review of the 2019 Novel Coronavirus (COVID-19)

Pathophysiology, Transmission, Diagnosis, and Treatment of Coronavirus Disease 2019 (COVID-19): A Review

A Review of Coronavirus Disease-2019 (COVID-19)

Impact of Lung Segmentation on the Diagnosis and Explanation of COVID-19 in Chest X-Ray Images

A Short Review on Different Clustering Techniques and Their Applications

Clustering Algorithms: A Comparative Approach

A Review of Clustering Algorithms for Big Data

Clustering Approaches for High-Dimensional Databases: A Review

Spatiotemporal Clustering: A Review

Data Clustering: A Review

Nonlinear Dimensionality Reduction

Stochastic Neighbor Embedding

Application of t-SNE to Human Genetic Data

The Protein-Small-Molecule Database, A Non-Redundant Structural Resource for the Analysis of Protein-Ligand Binding

Dimensionality Reduction and Visualisation of Hyperspectral Ink Data Using t-SNE

Unsupervised Clustering of Hyperspectral Paper Data Using t-SNE

Dimensionality Reduction in Deep Learning for Chest X-Ray Analysis of Lung Cancer

Nonlinear Dimension Reduction for EEG-Based Epileptic Seizure Detection

Data-Driven Identification of Prognostic Tumor Subpopulations Using Spatially Mapped t-SNE of

Mass Spectrometry Imaging Data

Context-Enriched Identification of Particles with a Convolutional Network for Neutrino Events

Revealing Fundamental Physics from the Daya Bay Neutrino Experiment Using Deep Neural Networks

Visual Clustering Analysis of Electricity Data Based on t-SNE

The Infinite Drum Machine

Intrinsic t-Stochastic Neighbor Embedding for Visualization and Outlier Detection

Principal Component Analysis: A Review and Recent Developments

Singular Value Decomposition and Principal Component Analysis

Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions

A Randomized Algorithm for the Decomposition of Matrices

GPGPU Linear Complexity t-SNE Optimization

Canny. t-SNE-CUDA: GPU-Accelerated t-SNE and its Applications to Modern Data

Approximated and User Steerable tSNE for Progressive Visual Analytics

Automated Optimized Parameters for t-Distributed Stochastic Neighbor Embedding Improve Visualization and Analysis of Large Datasets

A Trivariate Clough-Tocher Scheme for Tetrahedral Data

Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks

Conditional WGANs with Adaptive Gradient Balancing for Sparse MRI Reconstruction

Estimates of Edge Detection Filters in Human Vision

Autoencoding Beyond Pixels Using a Learned Similarity Metric

Improving Image Autoencoder Embeddings with Perceptual Loss

Image Quality Assessment: From Error Visibility to Structural Similarity

Loss Functions for Image Restoration with Neural Networks

Pairwise Supervised Hashing with Bernoulli Variational Auto-Encoder and Self-Control Gradient Estimator

Semantic Hashing with Variational Autoencoders

Deep Hashing Based on VAE-GAN for Efficient Similarity Retrieval

A Binary Variational Autoencoder for Hashing

Toward End-to-End Neural Architecture for Generative Semantic Hashing

Variational Deep Semantic Hashing for Text Documents

Generative Adversarial Network Training is a Continual Learning Problem

Gradient Descent with Early Stopping is Provably Robust to Label Noise for Overparameterized Neural Networks

Bounding the Expected Run-Time of Nonconvex Optimization with Early Stopping

Using Pre-Training can Improve Model Robustness and Uncertainty

An Alternative Probabilistic Interpretation of the Huber Loss

Robust Estimation of a Location Parameter

AutoClip: Adaptive Gradient Clipping for Source Separation Networks

On the Difficulty of Training Recurrent Neural Networks

Stochastic Optimization with Heavy-Tailed Noise via Accelerated Gradient Clipping

On the Variance of the Adaptive Learning Rate and Beyond

ADAM: A Method for Stochastic Optimization

Pixel Subset Super-Compression with a Generative Adversarial Network (Unfinished Manuscript)

Atomic Scale Deep Learning

Pixel Subset Super-Compression of STEM Images

Dynamic Scan Control in STEM: Spiral Scans. Advanced Structural and Chemical Imaging

Precision Controlled Atomic Resolution Scanning Transmission Electron Microscopy Using Spiral Scan Pathways

Survey on FPGA Architecture and Recent Applications

Joint Denoising and Distortion Correction for Atomic Column Detection in Scanning Transmission Electron Microscopy Images

Correction of Image Drift and Distortion in a Scanning Electron Microscopy

COVID-19 Timeline

/04/covid-19-timeline

Deeply Uncertain: Comparing Methods of Uncertainty Quantification in Deep Learning Algorithms

Uncertainty Quantification in Deep Learning: Literature Survey

Evaluation of Uncertainty Quantification in Deep Learning

A General Framework for Uncertainty Estimation in Deep Learning

Geometry and Uncertainty in Deep Learning for Computer Vision

Uncertainty in Deep Learning

Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and use Interpretable Models Instead

Simple and Scalable Epistemic Uncertainty Estimation Using a Single Deep Deterministic Neural Network

Simple and Scalable Predictive Uncertainty Estimation Using Deep Ensembles

A Simple Baseline for Bayesian Uncertainty in Deep Learning

Bayesian Uncertainty Estimation for Batch Normalized Deep Networks

What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?

Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning

Bagging Predictors

Learning Implicit Generative Models by Matching Perceptual Features

Asymptotic Optimality of Finite Model Approximations for Partially Observed Markov Decision Processes With Discounted Cost

Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems

Memory-Based Control with Recurrent Neural Networks

The Need for Reporting Negative Results -A 90 Year Update

Dealing with the Positive Publication Bias: Why You Should Really Publish Your Negative Results

Publication Bias and the Canonization of False Facts. eLife

Identification of and Correction for Publication Bias

The Dark Side of Radiomics: On the Paramount Importance of Publishing Negative Results

Is Positive Publication Bias Really a Bias, or an Intentionally Created Discrimination Toward Negative Results?

Negativity Towards Negative Results: A Discussion of the Disconnect Between Scientific Worth and Scientific Culture

Attention is All You Need

The Illustrated Transformer

A Comparative Study on Transformer vs RNN in Speech Applications

A Comparison of Transformer and LSTM Encoder Decoder Models for ASR

Image Denoising Review: From Classical to State-of-the-Art Approaches

Image Denoising: Issues and Challenges

Brief Review of Image Denoising Techniques. Visual Computing for Industry

Image Denoising by Sparse 3-D Transform-Domain Collaborative Filtering

An Analysis and Implementation of the BM3D Image Denoising Method

Xception: Deep Learning with Depthwise Separable Convolutions

ImageNet Classification with Deep Convolutional Neural Networks

Universal Approximation with Deep Narrow Networks

ResNet with One-Neuron Hidden Layers is a Universal Approximator

Approximating Continuous Functions by ReLU Nets of Minimal Width

The Expressive Power of Neural Networks: A View from the Width

Approximation Theory of the MLP Model in Neural Networks

Multilayer Feedforward Networks with a Nonpolynomial Activation Function can Approximate any Function

Approximation Capabilities of Multilayer Feedforward Networks

Multilayer Feedforward Networks are Universal Approximators

Approximation by Superpositions of a Sigmoidal Function

Convolutional Deep Belief Networks on CIFAR-10

Rectified Linear Units Improve Restricted Boltzmann Machines

Deep Sparse Rectifier Neural Networks

AutoML: A Survey of the State-of-the-Art

Modeling Neural Architecture Search Methods for Deep Networks

Reinforcement Learning for Neural Architecture Search: A Review. Image and Vision Computing

Neural Architecture Search: A Survey

Automated Machine Learning: Review of the State-of-the-Art and Opportunities for Healthcare

Can AutoML Outperform Humans? An Evaluation on Popular OpenML Datasets Using AutoML Benchmark

Learning Transferable Architectures for Scalable Image Recognition

Advances in Computational Methods for Transmission Electron Microscopy Simulation and Image Processing

Tutorial on Off-Axis Electron Holography. Microscopy and Microanalysis

Off-Axis and Inline Electron Holography: A Quantitative Comparison

Hybridization Approach to In-Line and Off-Axis (Electron) Holography for Superior Resolution and Phase Sensitivity

Robust Perceptual Night Vision in Thermal Colorization

Learning to See in the Dark

Phase Recovery and Holographic Image Reconstruction Using Deep Learning in Neural Networks

Extended Depth-of-Field in Holographic Imaging Using Deep-Learning-Based AutofocUsing and Phase Recovery

Optics of High-Performance Electron Microscopes

Optical and Digital Microscopic Imaging Techniques and Applications in Pathology

Direct Exit-Wave Reconstruction from a Single Defocused Image

Direct Retrieval of a Complex Wave from its Diffraction Pattern

Atomic-Resolution Cryo-STEM Across Continuously Variable Temperature

The Impact of STEM Aberration Correction on Materials Science

Twenty Years After: How "Aberration Correction in the STEM" Truly Placed a "A Synchrotron in a Microscope

Aberration Correction Past and Present

Automated Labeling of Electron Microscopy Images Using Deep Learning

Artificial Intelligence Faces Reproducibility Crisis

TensorFlow: A System for Large-Scale Machine Learning

Large-Scale Machine Learning on Heterogeneous Distributed Systems

Online: www.gatan.com/products/tem-analysis/gatanmicroscopy-suite-software

Real-Time Data Processing Using Python in DigitalMicrograph

Future Progress in Artificial Intelligence: A Survey of Expert Opinion

Towards Game Design via Creative Machine Learning (GDCML)

Co-Creative Level Design via Machine Learning

DLPaper2Code: Auto-Generation of Code from Deep Learning Research Papers

Large-Scale Machine Learning Systems in Real-World Industrial Settings: A Review of Challenges and Solutions

Scalable Machine-Learning Algorithms for Big Data Analytics: A Comprehensive Review

The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction

Hidden Technical Debt in Machine Learning Systems

Self-Organization, Natural Selection, and Evolution: Cellular Hardware and Genetic Software

Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks

Spectral Normalization for Generative Adversarial Networks

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Resume of Jeffrey Mark Ede

English Salary: Best Offer Willing to Relocate: Yes Remote Work: Prefer On-Site LINKS arXiv

I am about to submit my finished doctoral thesis and want to arrange a job as soon as possible. My start date is flexible. I have four years of programming experience and a background in physics, machine learning, and automation. EXPERIENCE Researcher -Machine Learning / Electron Microscopy From Oct 2017 at the University of Warwick My doctoral thesis titled

• Generative adversarial networks for quantum mechanics and compressed sensing. • Signal denoising for low electron dose imaging

• Curation, management, and processing of large new machine learning datasets

Review: Deep Learning in Electron Microscopy

Warwick Electron Microscopy Datasets

Partial Scanning Transmission Electron Microscopy with Deep Learning

Adaptive Learning Rate Clipping Stabilizes Learning

Improving Electron Micrograph Signal-to-Noise with an Atrous Convolutional Encoder-Decoder

Exit Wavefunction Reconstruction from Single Transmission Electron Micrographs with Deep Learning

Adaptive Partial Scanning Transmission Electron microscopy with Reinforcement Learning

Deep Learning Supersampled Scanning Transmission Electron Microscopy

Thanks go to Jeremy Sloan and Martin Lotz for internally reviewing this article. In addition, part of the text in section 1.2 is adapted from our earlier work with permission 201 under a Creative Commons Attribution 4.0 73 license. Finally, the author acknowledges funding from EPSRC grant EP/N035437/1 and EPSRC Studentship 1917382.

The author declares no competing interests. 

The author declares no competing interests. 

Thanks go to Jasmine Clayton, Abdul Mohammed, and Jeremy Sloan for internal review. The author acknowledges funding from EPSRC grant EP/N035437/1 and EPSRC Studentship 1917382.

J.M.E. proposed this research, wrote the code, collated training data, performed experiments and analysis, created repositories, and co-wrote this paper. R.B. supervised and co-wrote this paper.

The authors declare no competing interests.

Supplementary information is available for this paper at https://doi.org/10.1038/s41598-020-65261-0.Correspondence and requests for materials should be addressed to J.M.E.Reprints and permissions information is available at www.nature.com/reprints.Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/. 

A detailed schematic of our neural network architecture is show in Fig. 6 . The components in our network are Avg Pool w x w, Stride x: Average pooling is applied by calculating mean values for squares of width w that are spatially separated by x elements.Bilinear Upsamp x m: This is an extension of linear interpolation in one dimension to two dimensions. It is used to upsample images by a factor of m.Clip [a,b]: Clip the inputs tensor values so that they are in a specified range. If values are less than a, they are set to a; if values are more than b, they are set to b.Concat, d: Concatenation of two tensors with the same spatial dimensions to a new tensor with the same spatial dimensions and both their feature spaces. The size of the new feature depth, d, is the sum of the feature depths of the tensors being concatenated.Conv d,w x w, Stride, x: Convolution with a square kernel of width, w, that outputs d feature layers. If the stride is specified, convolutions are only applied to every xth spatial element of their input, rather than to every element. Striding is not applied depthwise.Sep Conv d,w x w, Stride, x, Rate, r: Depthwise separable convolutions consist of depthwise convolutions that acts on each feature layer followed by pointwise convolutions. The separation of the convolution into two parts allows it to be implemented more efficiently on most modern GPUs. The arguments specify a square kernel of width w that outputs d feature layers. If the stride is specified, convolutions are only applied to every xth spatial element of their input, rather than to every element. Strided convolutions are used so that networks can learn their own downsampling and are not applied depthwise. If an atrous rate, r, is specified, kernel elements are spatially spread out by an extra r 1 elements, rather than being next to each other.Trans Conv d, w x w, Stride, x: Transpositional convolutions; sometimes called deconvolutions after [65] , allow the network to learn its own upsampling. They can be thought of as adding x 1 zeros between spatial elements, then applying a convolution with a square kernel of width w that outputs d feature maps.Circled plus signs indicate residual connections where incoming tensors are added together. These help reduce signal attenuation and allow the network to learn identity mappings more easily.All convolutions are followed by batch normalization then ReLU6 activation. Extra batch normalization is added between the depthwise and pointwise convolutions of depthwise separable convolutions. Weights were Xavier uniform initialized; biases were zero-initialized. 

Half of wavefunction information is undetected by conventional transmission electron microscopy (CTEM) as only the intensity, and not the phase, of an image is recorded. Following successful applications of deep learning to optical hologram phase recovery, we have developed neural networks to recover phases from CTEM intensities for new datasets containing 98340 exit wavefunctions. Wavefunctions were simulated with clTEM multislice propagation for 12789 materials from the Crystallography Open Database. Our networks can recover 224×224 wavefunctions in ∼25 ms for a large range of physical hyperparameters and materials, and we demonstrate that performance improves as the distribution of wavefunctions is restricted. Phase recovery with deep learning overcomes the limitations of traditional methods: it is live, not susceptible to distortions, does not require microscope modification or multiple images, and can be applied to any imaging regime. This paper introduces multiple approaches to CTEM phase recovery with deep learning, and is intended to establish starting points to be improved upon by future research. Source code and links to our new datasets and pre-trained models are available at https://github.com/Jeffrey-Ede/one-shot.Keywords: deep learning, electron microscopy, exit wavefunction reconstruction

Information transfer by electron microscope lenses and correctors can be described by wave optics [1] as electrons exhibit wave-particle duality [2, 3] . In a model electron microscope, a system of condenser lenses directs electrons illuminating a material into a planar wavefunction, ψ inc (r, z), with wavevector, k. Here, z is distance along its optical axis in the electron propagation direction, described by unit vectorẑ, and r is the position in a plane perpendicular to the optical axis. As ψ inc (r, z) travels through a material in fig. 1a , it is perturbed to an exit wavefunction, ψ exit (r, z), by a material potential.The projected potential of a material in directionẑ, U(r, z), and corresponding structural information can be calculated from ψ exit (r, z) [4, 5] . For example,

Thanks go to Christoph T. Koch for software used to collect experimental focal series, to David Walker for suggesting materials in fig. 7 , and to Jessica Marshall for feedback on fig. 7 .Funding: J.M.E. acknowledges EPSRC grant EP/N035437/1 and EPSRC Studentship 1917382 for financial support, R.B. acknowledge EPSRC grant EP/N035437/1 for financial support, J.J.P.P. acknowledges EPSRC grant EP/P031544/1 for financial support, and J.S. acknowledges EPSRC grant EP/R019428/1 for financial support.

Collecting experimental CTEM holograms with a biprism or focal series reconstruction is expensive: Measuring a large number of representative holograms is time-intensive, and requires skilled electron microscopists to align and operate microscopes. In this context, we propose a new method to reconstruct holograms by extracting information from a large image database with deep learning. It is based on the idea that individual images are fragments of aberration series sampled from an aberration series distribution. To be clear, this section summarizes an idea and is intended to be a starting point for future work.Let ψ exit ∼ Ψ exit denote an unknown exit wavefunction, ψ exit , sampled from a distribution, Ψ exit , c ∼ C denote an unknown contrast transfer function (CTF), c = ψ pert (q)/ψ dif (q), sampled from a distribution, C, and m ∼ M denote metadata, m, sampled from a distribution, M, that restricts Ψ exit . The image wave isWe propose introducing a faux CTF, c ∼ C , to train a cycle-consistent generator, G, and discriminator, D, to predict the exit wave,The faux CTF can be used to generate an image wavefunctionψ img = FT −1 (c FT(ψ exit )).If the faux distribution is realistic, D can be trained to discriminate between |ψ img | and |ψ img |. For example, by minimizing the expected value ofwhere m m if metadata describes different CTFs. A cycle-consistent adversarial generator can then be trained to minimize the expected value ofwhere λ weights the contribution of the adversarial and cycle-consistency losses. The adversarial loss trains the generator to produce realistic wavefunctions, whereas the cycle-consistency loss trains the generator to learn unique solutions. Alternatively, CTFs could be preserved by mappingwhen calculating the L2 norm in eqn. 31. If CTFs are preserved by this mapping, c is a relative; rather than absolute, CTF and cc is the CTF ofψ img . Two of our experimental datasets containing 17267 TEM and 16227 STEM images are available with our new wavefunction datasets [26] . However, the images are unlabelled to anonymise contributors; limiting metadata available to restrict a distribution of wavefunctions.

As a potential starting point for experimental one-shot exit wavefunction reconstruction, we have made 1000 focal series publicly available [26] . We have also made simple focal series reconstruction code available at [79] . Alternatively, refined focal and tilt series reconstruction (FTSR) software is commercially available [95] .Each series consists of 14 32-bit 512×512 TIFFs, area downsampled from 4096×4096 with MATLAB [96] and default antialiasing.All series were created with a common, quadratically increasing [20] defocus series. However, spatial scales vary and must be fitted as part of reconstruction. 14 228 Example applications of ANNs are shown in figs. S1-S18, and source code for every ANN is available in [1] . Phases in [−π, π) rad are depicted on a linear greycale from black to white. Wavefunctions are cyclically periodic functions of phase so distances between black and white pixels are small.