key: cord-0291659-odhphnwd authors: Storer, Dara; Dillon, Joshua S.; Jacobs, Daniel C.; Morales, Miguel F.; Hazelton, Bryna J.; Ewall-Wice, Aaron; Abdurashidova, Zara; Aguirre, James E.; Alexander, Paul; Ali, Zaki S.; Balfour, Yanga; Beardsley, Adam P.; Bernardi, Gianni; Billings, Tashalee S.; Bowman, Judd D.; Bradley, Richard F.; Bull, Philip; Burba, Jacob; Carey, Steven; Carilli, Chris L.; Cheng, Carina; DeBoer, David R.; Acedo, Eloy de Lera; Dexter, Matt; Dynes, Scott; Ely, John; Fagnoni, Nicolas; Fritz, Randall; Furlanetto, Steven R.; Gale-Sides, Kingsley; Glendenning, Brian; Gorthi, Deepthi; Greig, Bradley; Grobbelaar, Jasper; Halday, Ziyaad; Hewitt, Jacqueline N.; Hickish, Jack; Huang, Tian; Josaitis, Alec; Julius, Austin; Kariseb, MacCalvin; Kern, Nicholas S.; Kerrigan, Joshua; Kittiwisit, Piyanat; Kohn, Saul A.; Kolopanis, Matthew; Lanman, Adam; Plante, Paul La; Liu, Adrian; Loots, Anita; MacMahon, David; Malan, Lourence; Malgas, Cresshim; Martinot, Zachary E.; Mesinger, Andrei; Molewa, Mathakane; Mosiane, Tshegofalang; Murray, Steven G.; Neben, Abraham R.; Nikolic, Bojan; Nunhokee, Chuneeta Devi; Parsons, Aaron R.; Pascua, Robert; Patra, Nipanjana; Pieterse, Samantha; Pober, Jonathan C.; Razavi-Ghods, Nima; Riley, Daniel; Robnett, James; Rosie, Kathryn; Santos, Mario G.; Sims, Peter; Singh, Saurabh; Smith, Craig; Tan, Jianrong; Thyagarajan, Nithyanandan; Williams, Peter K. G.; Zheng, Haoxuan title: Automated Detection of Antenna Malfunctions in Large-N Interferometers: A Case Study with the Hydrogen Epoch of Reionization Array date: 2021-09-27 journal: nan DOI: 10.1029/2021rs007376 sha: 5c1be49c6b243f862d5cd2d3ea14faf23996e8ed doc_id: 291659 cord_uid: odhphnwd We present a framework for identifying and flagging malfunctioning antennas in large radio interferometers. We outline two distinct categories of metrics designed to detect outliers along known failure modes of large arrays: cross-correlation metrics, based on all antenna pairs, and auto-correlation metrics, based solely on individual antennas. We define and motivate the statistical framework for all metrics used, and present tailored visualizations that aid us in clearly identifying new and existing systematics. We implement these techniques using data from 105 antennas in the Hydrogen Epoch of Reionization Array (HERA) as a case study. Finally, we provide a detailed algorithm for implementing these metrics as flagging tools on real data sets. Study of the Epoch of Reionization (EoR) through detection and observation of the 21 cm emission line from neutral hydrogen will provide critical insights into the formation of the earliest structures of the universe, and help inform understanding of the underlying physics behind galaxy formation and the intergalactic medium (Furlanetto et al., 2006; Morales & Wyithe, 2010; Pritchard & Loeb, 2012) . There are currently several interferometric arrays working to detect the 21 cm signal, including the the Precision Array for Probing the Epoch of Reionization (PAPER) (Parsons et al., 2010) , the Giant Metrewave Radio Telescope (GMRT; Paciga et al. (2011) ), the Murchison Widefield Array (MWA; Tingay et al. (2013) ), the LOw Frequency ARray (LOFAR; van Haarlem et al. (2013) ), and the Canadian Hydrogen Intensity Mapping Experiment (CHIME; Newburgh et al. (2014) ), the Hydrogen Epoch of Reionization Array (HERA; DeBoer et al. (2017)), and the Large-Aperture Experiment to Detect the Dark Age (LEDA; Price et al. (2018) ). In addition, there are exciting new experiments on the horizon, including the upcoming Square Kilometer Array (SKA; Mellema et al. (2013) ) and the upcoming Hydrogen Intensity and Real-time Analysis eXperiment (HIRAX; Saliwanchik et al. (2021) ). The 21 cm fluctuation signal is very faint; typical models forecast signal amplitudes in the tens of millikelvin, making the signal four to five orders of magnitude fainter than the bright radio foregrounds (Santos et al., 2005; Bernardi et al., 2010) . Attempts to measure the power spectrum using radio interferometers must therefore be executed with high sensitivity and precision analysis techniques in order to realistically achieve a detection (Liu & Shaw, 2020) . Achieving sufficient sensitivity requires an interferometer with a large number of antennas observing for months, which introduces a high level of complexity to the system. Therefore, the need for high sensitivity and precision results in thousands of interconnected subsystems that must be commissioned by a relatively small number of people, which poses a significant challenge. Additionally, due to the faintness of the signal, low level systematics that might be deemed negligible in other astronomical applications can have the potential to leak into the power spectrum and obscure the 21 cm signal. Therefore, systematics must either be resolved, methodically avoided, or directly removed in order to achieve sufficiently clean data. Some examples of contaminants common in these types of interferometers include adverse primary beam effects (Beardsley et al., 2016a; Fagnoni et al., 2020; Joseph et al., 2019; Chokshi et al., 2021) , internal reflections Beardsley et al., 2016b; Kern et al., 2019; Kern, Parsons, et al., 2020; Kern, Dillon, et al., 2020) , radio frequency interference (RFI) (Wilensky et al., 2020; Whitler et al., 2019) , and any analog or digital systematics resulting from the specific design and configuration of the array and its component electronics (Benkevitch et al., 2016; de Gasperin et al., 2019; Star, 2020) . In this work we focus on any systematics arising from a malfunction in an individual antenna, component, or subsystem, using data from HERA as a case study to implement and test our methods. While there are some systematics we can avoid using clever analysis techniques (see Kern, Parsons, et al. (2020) for example), we manage most systematics by directly removing the affected antennas from the raw data. This requires us to identify and flag any data exhibiting a known malfunction, and develop methodologies for catching new or previously unidentified systematic effects. While the primary goal of flagging data is to produce the cleanest possible data for analysis, it has the added benefit of providing information regarding the scope and character of prevalent issues to the commissioning team, which is essential to our ultimate goal of finding and resolving the source of the problem. The purpose of this work is to outline a framework for identifying and flagging malfunctioning antennas. While manual inspection of all data would likely be an effective approach to antenna flagging, for large-N interferometers the data volume poses a problem to this approach. For example, when completed HERA will have 350 individual dishes each with a dual-polarization signal chain including several analog and digital subcomponents. Even just for the 105 elements included in the data used here, manual flagging would involve assessing 22,155 baselines, each of which has 1024 frequency bins and thousands of time integrations. Therefore, the hands-on time involved is neither practical nor reproducible, and so an automated approach is preferred. In this paper we present an automated approach to antenna quality assessment and flagging. Our approach is to design a set of statistical metrics based on common failure modes of the interferometric instruments. We also optimize the metrics to use a limited fraction of the data so they are usable in a real time pipeline. We break these metrics into two categories: cross-correlation metrics (per-antenna values calculated using all baselines), and auto-correlation metrics (per-antenna values calculated using only the autocorrelations). For the duration of this paper we define cross-correlations as correlations between two different antennas, and auto-correlations as the correlation of an antenna with itself. These two methods have complementary advantages. The cross-correlation metrics require a larger data volume, but give us insight into the performance of the whole array and all component subsystems, whereas the auto-correlation metrics are optimized to use a small amount of data, and help assess functionality of individual array components. We outline how each of our metrics is designed to catch one or more known failure modes in the smallest amount of data possible and validate that the automation procedure flags these failures effectively. We also use tools such as simulated noise and comparisons with manual flags to aid in validating our procedure. While these metrics were designed based on HERA data, it is important to note that both the approach and the metrics themselves are applicable to any large interferometric array. The HERA data used in this paper were collected on September 29, 2020 (JD 2459122) when there were 105 antennas online, shown graphically in Figure 1 . Note that this data is from the second phase of the HERA array, which uses Vivaldi feeds rather than dipoles, along with other changes, and differs significantly from the phase one data analyzed in HERA Collaboration et al. (2021) . The HERA receivers are distributed throughout the array in nodes which contain modules for post-amplification, filtering, analog to digital conversion, and a frequency Fourier transform. Each node serves up to 12 antennas. Node clocks are synchronized by a White Rabbit timing network (Moreira et al., 2009 ). Figure 1 illustrates the node architecture overlain with antenna cataloging developed in this paper. These flags were produced using almost ten hours of data from this night. The high fraction of malfunctioning antennas was partly attributable to limited site access due to the COVID-19 pandemic. HERA has no moving parts and performs a drift scan observation of ∼10 • patch around zenith. The portion of the sky observed on JD 2459122 is shown overlaid on the radio sky in Figure 2 . This paper is organized as follows. In Section 2 we outline the two cross-correlation metrics, providing details of their calculation and a demonstration of their utility. We also examine the distribution of the primary cross-correlation metric across the array, and investigate whether systematics are affecting its statistics. In Section 3 we introduce four auto-correlation metrics, explaining their necessity, describing their precise statistical formulation, and giving examples of typical and atypical antennas. Finally, in Section 4 we summarize our methods and results. (Dillon & Parsons, 2016; DeBoer et al., 2017) . Only actively instrumented antennas are drawn; many more dishes had been built by this point. Flagging of misbehaving antennas is necessary in preventing them from impacting calibration, imaging, or power spectrum calculation steps. Here we define a misbehavior to be any feature which makes an antenna unusual when compared to others. In practical terms, the pathologies of antenna malfunction are not limited to the signal chain at the antenna, but could manifest anywhere in the system up to the output of the correlator. Depending on where along the signal chain the pathology lies, we might see evidence of it in the either the auto-correlations, the cross-correlations, or both. For example, if an antenna's timing was out of sync with another's, its auto-correlations might look fine, but its cross-correlations would highlight this systematic. In particular, as an interferometric array grows in size, it is vital to track the health of the entire array, not just the auto-correlations or the cross-correlations in isolation. In Section 2.1 we define a new cross-correlation metric that is aimed at quantifying how well each antenna is correlating with the rest of the array, and we validate this metric with a simulation. Next, in Section 2.2 we utilize this correlation metric to iden- tify cross-polarized antennas. Finally, in Section 2.3 we outline our specific algorithm for identifying and removing problematic antennas using the cross-correlation metric framework. Our most generalized metric for assessing antenna function tests how well antennas correlate with each other. There are many reasons antennas might not correlate: one of the gain stages might be broken, cables might be hooked up incorrectly, or not phasealigned with other functional antennas. Assessment of cross-correlations in uncalibrated data is challenging because the correlations can vary widely depending on the baseline length and sky configuration. In particular, one must be able to tell the difference between baselines that include both the expected sky signal and noise versus baselines that include only noise. A metric which is robust against these and other challenges is the normalized and averaged correlation matrix C ij : where t,ν represents an average over time and frequency, and V even ij and V odd ij are pairs of measurements of the same sky with independent noise, and i and j are antenna indicies, such that ij represents an individual baseline. This holds for any correlator outputs separated by timescales short enough that the sky will not rotate appreciably, so that we can assume that time adjacent visibilities are observing the same sky signal but with independent noise realizations. 1 Division by the visibility amplitude in Equation 1 minimizes the impact of very bright RFI that might differ between even and odd visibilities and dominate the statistics. We experimented with alternative statistics like a maximum and a median to compress across time and frequency but found that with the normalized correlation a simple average was sufficiently robust. Due to our chosen normalization, the correlation metric measures the phase correlation between visibilities, and is unaffected by overall amplitudes. If the phases are noise-like, the antennas will be uncorrelated and this value will average down to zero. If V even ij and V odd ij are strongly correlated, we expect this statistic to be near one. The normalization in Equation 1 is particularly useful in mitigating the effects of RFI and imperfect power equalization between antennas. We can visualize the correlation matrix C ij with each baseline pair ij as an individual pixel, such that the auto-correlations fall along the diagonal. A schematic of this visualization is shown in Figure 3 . To emphasize any patterns related to electronic connectivity, antennas are organized by their node connection, and within that by their subnode level electronic connections. Node boundaries are denoted by light blue lines. While the nodal structure used here is specific to HERA, the principal of organizing by electronic connectivity is a generalizable technique for highlighting patterns that may be due to systematics in particular parts of the system. Additionally, plotting the matrices in this way allows us to assess the system health on an array-wide level and on an individual antenna level all in one plot, which is increasingly useful as the size of an array grows. To study the performance of any single antenna it is useful to form a per-antenna cross-correlation metric C i by averaging over all baselines that include a given antenna: where N ants is the number of antennas. We calculate this metric separately for all four instrumental visibility polarizations: N N , EE, EN , N E. The panels below each matrix in Figure 3 show this per-antenna average correlation metric C ij . Next, Figure 4 shows a visualization of C ij for all four polarizations, using data from a representative subset of the HERA array for simplicity. Here the values have a bimodal distribution (most obvious in the East-East and North-North polarizations), where most antennas are either showing a consistently low metric value, or are close to the array average. This bimodality is also clear in the lower panels showing the per-antenna metric C i . Here we see more clearly that there is a fairly stable array-level average metric value for each polarization, with a handful of antennas appearing as outliers. The dashed line in the lower panels shows the threshold that is used for antenna flagging, with the points below the threshold marked in red -see Section 2.3 for more on this. There are three primary features to note in Figure 4 . First, we see that antennas 51 and 87 are lower than the array average in the North-North and East-East polarizations, but are higher 1 In HERA's case we are able to utilize our specific correlator output to construct even and odd visibilities that are interleaved on a 100 ms timescale. To explain this, we digress briefly into the output of the HERA correlator. In its last stage of operation, antenna voltage spectra are cross-multiplied and accumulated over 100 ms intervals. These visibilities can be averaged over the full 9.6 second integration before being written to disk. However, in order to improve our estimate of noise and to aid in the estimation of power spectra without a thermal noise bias, we split these 96 spectra into two interleaved groups, even and odd, and sum them independently before writing them to disk. Thus, each is essentially 4.8 seconds of integrated sensitivity, spread over 9.6 seconds of observation. Antenna 10 (4,10) Antenna 10 column mean than average in the other two polarizations. Thses points are marked in cyan in the lower panel. The reason for this pathology is that antennas 51 and 87 are cross-polarized, meaning that the cables carrying the East and North polarizations are swapped somewhere along the cable path -this will be discussed further in Section 2.2. Second, we can see that the typical value of C ij is higher in the East-East polarization than in the North-North polarization. This is because of the elevated signal-to-noise ratio observed in the East-East polarization due to contributions from the galactic plane and diffuse emission. Lastly, we observe that there appears to be a slight increase in the average metric power for baselines within the same node compared over baselines to antennas in different nodes. We explore this effect in the next section. Figure 4 shows that there is a significant amount of structure in the correlation matrices, specifically related to node connections. Baselines within a node appear to have larger values of C ij than baselines between nodes. We have previously noticed instances of severe node-based structure when there are timing mismatches between nodes due to a failure of the clock distribution system. Figure 5 is an example from an observation when the timing system was known to be broken, and we see clearly that timing mismatches depress the correlation metric. This causes much clearer node structure than the more common structure seen in Figure 4 . Therefore, one wonders: are the larger C ij values on the intra-node baselines due to some milder form of this clock distribution issueperhaps a small error in timing-or is this structure otherwise explicable or even expected? Put another way, what is the expectation value of C ij as defined in Equation (1)? We can make the assumption that V even and that the two only differ by their noise, n ij , with mean 0 and variance σ 2 ij . Ignoring time and frequency dependence, then we can use Equation (1) to first order (ignoring correlations between the numerator and denominator) to find that This approximate expectation value shows us the importance of the signal-to-noise ratio (SNR). At high SNR, C ij goes to 1, assuming the two even and odd signal terms are actually the same-i.e. that the array is correlating. At low SNR, C ij goes to 0. It follows then that the apparent node-based structure in C ij might actually be the impact of the the relationship between SNR and baseline length. Inspecting the array configuration (see Figure 1 ) we see that baselines within the same node tend to be shorter than baselines involving two nodes. Shorter baselines are dominated by diffuse galactic synchrotron emission, which means that they tend to have a higher signal than longer baselines. Since all baselines have similar noise levels and since higher SNR leads to larger values of C ij , this could account for the effect. In order to confirm that our node structure is explicable as a baseline length effect rather than some other systematic, we can implement a simple simulation with thermal noise. We calculate V true ij from our data as (V even ij +V odd ij )/2, and take this as a reasonable stand-in for the sky signal, in lieu of a more sophisticated simulation, since it should have approximately the right relative power and should largely average out the instrumental noise. To each visibility V true ij we then add independent Gaussian-distributed thermal noise, with variance given by where ∆t is the integration time and ∆ν is the channel width. This noise is uncorrelated between baselines, times, and frequencies. We then calculate C ij . We compare the C ij 11 12 13 14 23 24 25 26 39 102 85 86 120 104 103 87 101 122 84 123 121 107 90 88 105 108 91 124 89 125 11 12 13 14 23 24 25 26 39 102 85 86 120 104 103 87 101 122 84 123 121 107 90 88 105 108 91 This is a sample case where the auto-correlations are nominally acceptable, and investigation of the cross-correlations is necessary to see this type of failure mode. with simulated noise to the observed C ij in Figure 6 . We can see clearly that the nodebased structure we observed in the original correlation matrices is completely reproduced when using a Gaussian noise estimate. This conclusion helps confirm that apparent nodebased structure in C ij is is driven by sky feature amplitude, which sets the SNR, rather than systematics. Finally, in Figure 7 we confirm that our metric distribution is representative of the sky by plotting C ij versus baseline length for all four polarizations using real sky data. We color each baseline by whether both constituent antennas were unflagged (blue), at least one was flagged for having a low correlation metric (red), or at least one was flagged for being cross-polarized (cyan). We clearly see the smooth distribution we would expect from sky features, with clearly distinguishable sub-groups by flagging categorization. We would expect a power law slope for galactic emission with strong variation as a function of baseline azimuthal angle, while the point source component should be independent of baseline length or angle, and noise should be similar to point sources (Byrne et al., 2021) . Notably, the nominally good antennas generally follow this pattern, with a strong increase towards shorter baselines. Additionally, baselines observing the North-East and East-North polarizations of the sky show a potential transition between galactic domination to point source or noise domination around 100 meters. At frequencies near the middle of the HERA band this corresponds to 1.5 degrees, which is roughly the scale at which point sources are commensurate with galactic emission (Byrne et al., 2021) . Given the significant agreement between our measured and expected distributions of C ij , we are confident in our conclusion that the observed structure in Figure 4 is driven by sky features rather than instrumental systematics. As we have already seen in passing, the correlation metric C ij clearly identifies crosspolarized antennas. Here, cross-polarized means that the physical cables carrying the East and North polarization measurements got swapped in the field. When things are hooked up correctly, we expect to see a stronger correlation between matching polariza -66 67 68 37 38 53 51 52 36 50 81 82 83 100 119 138 98 99 116 117 118 137 102 85 86 120 104 103 87 101 122 84 123 121 66 67 68 37 38 53 51 52 36 50 81 82 83 100 119 138 98 99 116 117 118 137 102 85 86 120 104 103 87 101 122 84 123 121 Antenna Number 03 07 08 Node Number Data-Based Calculation 66 67 68 37 38 53 51 52 36 50 81 82 83 100 119 138 98 99 116 117 118 137 102 85 86 120 104 103 87 101 122 84 123 121 03 07 08 Node Number Gaussian Noise-Based Calculation 66 67 68 37 38 53 51 52 36 50 81 82 83 100 119 138 98 99 116 117 118 137 102 85 86 120 104 103 87 101 122 84 123 tions (i.e. EE 2 and N N ), and a weaker correlation between different polarizations. Cross polarized antennas have the opposite situation, with stronger correlation in EN and N E. We identify this situation automatically with a cross-polarization metric formed from the difference between four polarization combinations in the per-antenna correlation metric: where P is either the EE or NN polarization, and P × is either the NE or EN polarization. We then calculate our cross-polarization metric as the maximum of the four combinations of same-polarization and opposite-polarization visibilities: We take the maximum because it's possible to get negative values for some of the C P −P× i when one polarization is dead and the other is not. However, when all four values are negative (ie a negative maximum), then the antenna is likely cross-polarized. In Figure 8 we show each of the four differences of C ij . Two antennas, 51 and 87, show negative values in all four combinations, indicating swapped cables. Three other antennas-37, 38, and 101-show up negative in two polarizations, which indicate a single dead polarization, rather than a swap. Using our correlation metric C i defined in Equation 2 and our cross-polarization statistic R i defined in Equation 6 we can implement an iterative algorithm to flag and remove broken and cross-polarized antennas. In Figure 4 we clearly saw that dead antennas have a value of C i very near zero. As a result, when we calculate C i for functional antennas by averaging over all constituent baselines, the low correlation between a functional and a dead antenna will decrease the overall value of C i for the functional antenna. In the case where only a couple of antennas are broken among the whole array this may be tolerable, but it is possible for this bias to cause functional antennas to look much worse than they are, and to potentially drop below the flagging threshold. To prevent the expected value of our metric from being biased by dead antennas, we implement an iterative metric calculation and flagging approach, outlined in Algorithm A1. First, we calculate C i for all antennas and identify any that are completely dead (i.e. C i =0) and remove them. Then, recalculate C i and R i for all antennas, identify and remove the worst antenna if it falls below the threshold. We continue with this recalculation and reassessment of the metrics until all remaining antennas are above the threshold in both metrics. Figure 9 shows a comparison between the values of C i calculated by directly averaging C ij for each antenna versus using the iterative algorithm. We see clearly from this figure that implementing an iterative approach brings our data into a truly bimodal realm where establishing a threshold is straightforward. Based on the observed values, we set an empirical threshold of C i = 0.4, such that any antennas below that value will be flagged and removed. Note that the two antennas marked in cyan are both cross-polarized, so their value near the threshold is not worrisome. As noted in section 2.2, these points are flagged for having a maximum value of R i below zero. This iterative approach to flagging is robust against varying proportions of broken antennas, which is essential for flagging during the commissioning phase of an array. While the iterative approach somewhat increases computation time, we find the tradeoff to be worthwhile. Even with the iterative approach, flagging based on the cross-correlation metrics scales at worst with the number of visibilities, which we find reasonable. Our most computationally expensive step is simply reading in all of the data. In the case where this step becomes computationally prohibitive in a real-time pipeline, we may take advantage of the time-stability of the correlation metric and calculate antenna flags on a sparser time interval. As it is computationally infeasible to hold data for all baselines over the whole night in memory at once, we reserve time domain data quality assessments for auto-correlations only, as discussed in Section 3. East-East Polarization It is also relevant to note that we only use the North-North and East-East polarizations during antenna flagging. There are multiple reasons for this. First, we see in Figure 7 that the expected value of C ij is higher in same-polarization correlations compared to cross-polarization correlations. The larger separation in expected value between functional and dead antennas leads to a more robust flagging threshold. Second, we have no evidence in HERA data to indicate the existence of systematics that appear only in correlations between different polarizations. Therefore, flagging only on the same-polarizations allows for clearer distinction between functional and broken antennas without missing any known failure modes of the system. While the correlation metrics provide an absolute check on data quality of a particular antenna, not all effects will be caught by this approach. For example, if one antenna has a bandpass structure completely unlike the rest-an effect that might be calibratableit is useful to identify it and flag it as a symptom of some deeper malfunction in the array. It is useful, therefore, to assess antennas for ways in which they deviate from others, assuming that the plurality of antennas will be well-behaved. 3 Identification of misbehavior is more difficult with a new system. A newly-built telescope system with novel combinations of technologies means that we lack an a-priori model for how signal chains might malfunction. In early commissioning we observed broadband temporal and spectral instabilities in visibilities which motivated a metric that examines whole nights of data. We choose to focus on auto-correlations V ii for two reasons. The first is data volume. The number of auto-correlations scales with N ant while the number of visibilities scales with N 2 ant -far too big to load into memory at once for a whole night of data. Second, because our goal is to identify malfunctioning antennas before calibration, we focus on auto-correlations because they are easier to compare without calibration. Comparison between visibilities measuring the same baseline separation requires at minimum a per-antenna delay calibration to flatten phases. That term in autocorrelations cancels out, leaving each V obs ii ∝ |g i | 2 V true auto . Since most bandpass gains should be similar, autocorrelations can be sensibly compared to one other to look for outliers before calibrating. Even if |g i | 2 differs between antennas, that is something we would like to know and perhaps rectify in the field. Historically, auto-correlations from radio interferometers are seldom used. For example, at the VLA the autos are usually discarded (Taylor & Rupen, 1999) . The usual reasons given for this are that auto-correlations have a noise bias and that gain variations are assumed to not correlate between antennas. However, given HERA's sensitivity to calibration stability, this assumption is worth re-considering. Recently, other collaborations have also begun exploring auto-correlations as a valuable tool for assessing data quality (Rahimi et al., 2021) and performing calibration (Barry et al., 2019) . Each antenna's auto-correlation stream can be reduced statistically across an entire observation to a single metric spectrum which can then be quickly compared to all other spectra to search for outliers. For HERA, a drift-scan telescope which operates continuously each night for months at a time, one full night's observation time is a useful averaging time range. We focus on four factors motivated by antenna failure modes noted in manual inspection of hundreds of antenna-nights of autocorrelation data: bandpass shape (Section 3.1), overall power level (Section 3.2), temporal variability (Section 3.3), and temporal discontinuities (Section 3.4). The purpose of this section is to develop quantitative metrics that capture these qualitative concerns in a rigorous way, attempting to reduce antenna "badness" along each of these dimensions to a single number. In Section 3.5 we show how these four statistics together produce a useful summary of per-antenna performance (see Figure 15 ). Each of these four statistics comes in two flavors. The first is a median-based statistic which is more robust against transient or narrow-band outliers in each time vs. frequency plot or "waterfall", like RFI. The second is a more sensitive mean-based statistic. Our basic approach, outlined in pseudocode in Algorithm A2, is to remove the worst antennas with the robust statistics, then flag RFI, then flag the more subtly bad antennas with the mean-based statistics. In the following sections, we offer a more precise definition of the calculations and the algorithmic application. Our first metric is designed to identify and flag antennas with discrepant bandpass structures. This often indicates a problem in the analog signal chain. As we mention in Algorithm A2, we first reduce the auto-correlation for antenna i, polarization p to a single spectrum S(ν) as follows. where med {} t indicates a median over time while med {} t,ν indicates a median over both time and frequency. This gives us a notion of the average bandpass shape while normalizing the result to remove differences between antennas due to overall power. The reduction from waterfall to spectrum only needs to be computed once per antenna. We can now compute the median difference between each antenna's spectrum and the median spectrum with the same polarization p according to the following formula: where j indexes over all unflagged antennas. To determine which antenna to flag, if any, we convert each D med i,p into a modified z-score by comparing it to the overall distribution of distances. These modified z-scores are defined as where MAD {} j is the median absolute deviation over antennas and erf −1 (x) is the inverse error function. The factor of √ 2erf −1 (0.5) normalizes the modified z-score so that the expectation value of a z mod of a sample drawn from a Gaussian distribution is the same as its standard z-score. 4 Having computed modified z-scores for every antenna and every polarization, we iteratively remove the antenna with the worst modified z over all metrics and both polarizations. When one polarization is flagged, we flag the whole antenna. We then recompute D med i,p and z mod i,p and continue flagging antennas until none have a modified zscore over a chosen threshold, in our case 8.0. All subsequent metrics use the same threshold for median-based flagging. Next we perform a simple RFI flagging, analogous to the algorithm used in HERA Collaboration et al. (2021), but performed on a single auto-correlation waterfall averaged over all remaining antennas. This process includes a search for local outliers after median filtering and then mean filtering, which are flagged as RFI. Finally, a thresholding algorithm is performed that throws out entire channels or entire integrations which are themselves significant outliers after analogous 1D filtering. The results of this process are shown in Figure 10 . This process flags 12.6% of the data, excluding band-edges, and leaves 11.3% of channels and 1.0% of all times completely flagged. This is likely an under-count of RFI; the algorithm is to designed to flag the most egregious outliers that might skew the statistics described below, rather than to find and remove RFI for the purpose of making sensitive 21 cm power spectrum measurements. After RFI flagging, we next compute shape metric spectra with mean-based statistics. Analogously to Equation 7 this case, where t indicates a weighted-mean over the time dimension, giving zero weight to times and frequencies flagged for RFI. Likewise, these spectra are reduced to scalar distance metrics as D mean 4 Were the distribution of distance metrics Gaussian (it is generally not), then one could think of modified z-score of 8 as an "8σ outlier." This kind of language is imprecise, but often useful for building intuition. where again averages are performed over unflagged antennas, times, and frequencies. Just as before, we compute modified z-scores to iteratively flag the worst antenna outlier, recalculating D mean i,p after each antenna is flagged. This proceeds until no antennas exceed a z-score of 4; half that used during the first round median cut. Again, this threshold is the same for mean-based flagging in all subsequent metrics. In Figure 11 we show the the results of this operation with example waterfalls and metric spectra for antennas that were and were not flagged by our modified z-score cut of 4.0. In general, we find that the metric robustly identifies antennas with metric spectra discrepant from the main group of antennas. Almost everything in red in Figure 11 is a pretty clear outlier. Where exactly to draw the line is tricky, and likely requires some manual inspection of metric spectra and waterfalls for antennas near the cutoff. Note that this figure includes flagging by all four metrics. Some moderate outliers in shape were not flagged for shape but were flagged for other reasons, indicating that this metric and the other three discussed below are not completely independent. We next turn to looking for outliers in bandpass power. High power might indicate incorrect amplifier settings while a signal chain malfunction might cause anomalously low power. Our approach for finding outliers in power is very similar to the one for finding outliers in bandpass shape laid out in Section 3.1. Here we lay out the mathematical approach, highlighting and motivating differences between the two. Once again, we begin by defining median-based metric spectra which collapse each antenna's waterfall down to a single number per frequency. For bandpass power, that is simply This is simply an unnormalized version of Equation 7. However, instead of directly comparing each antenna's spectrum with the median spectrum, we instead compare their log- arithms: This logarithmic distance measure reflects the fact that gains are multiplicative and that the optimal ranges for amplifier and digitization are themselves defined in decibels. We take the absolute value of the difference of the logs because we want to penalize both antennas with too little power, which may indicate a malfunction, and antennas with too much power, which may cause a nonlinear response to the sky signal. After RFI flagging as described in the previous section, we next proceed with outlier detection using modified mean-based statistics, which are straightforward adaptations of Equations 12 and 13: Once again, as we can see in Figure 12 , this metric picks a number of antennas that are clearly behaving differently than the main group. As we saw in the previous section we see there are some antenna which appear to be "in family" according to this metric but are flagged for other reasons. But now we can start to see why this might be. A few of the flagged antennas appear to be fine according to their bandpass shape but are significantly lower or higher in power than the rest. We now turn to the the question of searching for outliers in the temporal structure of the antenna response. While the metrics follow a similar pattern-median-based spec- tra and distances, followed by mean-based spectra and distances-they are mathematically quite different from those in Sections 3.1 and 3.2. During observing and subsequent inspection analysis sharp discontinuities were observed in the auto-correlations. Often, though not always, these are rapid changes occurring within a single integration. Sometimes they are accompanied with apparent changes in the bandpass shape or power. Sometimes the effects are relatively localized in frequency; sometimes they are broadband. Sometimes they are frequent jumps; sometimes there are just a handful of discontinuities followed by minutes or hours of stability. Developing a physical understanding of the origin of these effects is an ongoing research effort outside the scope of this paper. Absent that understanding-and a hardware fix to prevent the effects-we have to consider this behavior suspicious and therefore meriting flagging. Here and in Section 3.4 we present two metrics for automatically identifying temporal effects. In general, we are looking for forms of temporal structure of the auto-correlations that cannot be explained by the sky transiting overhead. The first looks for high levels of temporal variability throughout the night. To distinguish temporal variability due to sky-rotation from anomalous temporal structure, our metrics are based on a comparison of each antenna's auto-correlation waterfall with an average waterfall over all antennas. For our first round of median statistics, we use the median absolute deviation of the waterfall along the time axis after dividing out the median waterfall over antennas. Thus, to produce a single spectrum for each antenna that can be reasonably interpreted as the standard deviation over time of each channel with respect to the mean over time. We calculate the distance metric for each antenna by taking the median over frequency of how much the antenna's temporal variability metric spectrum exceeds the me-dian metric spectrum over all antennas: Note that we do not take the absolute value of the difference; while shape and power mismatches are penalized both for being too low and for being too high, we do not penalize antennas for varying less that the median. These simply become negative z-scores -indicating that an antenna has less temporal variation than the median signal-and do not result in flags. Our mean-based metrics are a straightforward adaptation of Equations 16 and 17: In theory, the denominator of Equations 16 and 18 should change each time an antenna is thrown out and the distance measures and modified z-scores are recomputed. This can be computationally expensive when a large fraction of the array needs flagging, as has sometimes been the case during HERA commissioning. In practice, we take a shortcut. During the median-statistics round, we simply neglect this effect, relying on the fact that the median statistics are relatively insensitive to the set of antennas that are flagged. During the next round using mean-based statistics, we iteratively remove antennas until no antennas remain above our modified z-score cut. Only then do we recompute the metric spectra in Equation 18. In general, this has the effect of making the metric spectra more sensitive to temporal variability, since the mean spectrum will include fewer anomalously variable antennas. The standard procedure of removing antennas and recalculating each D mean i,p (but not each S mean i,p (ν)) is repeated. This loop continues until no more more antennas are flagged after recalculating S mean i,p (ν) one final time. As a brief aside, we present Table 1 , which shows the number of antennas flagged by each metric at each step. The table shows that the power and shape metrics are relatively bimodal, in that the vast majority of antennas flagged by those metrics were bad enough to be flagged by the median-based statistics, and very few antennas required the more sensitive iterative approach. In contrast, we see that antennas flagged by the temporal variability and temporal discontinuities (outlined in the next section) metrics have a more gradual distribution of badness, rendering the mean-based iterative flagging step all the more necessary. In Figure 13 we show the the resulting mean-based metric spectra after iteratively removing outliers. While there are some very clear outliers that are successfully identified, the precise line between what should be considered good and what should be considered bad is ambiguous. Clearly the pathology seen in Antenna 110 is worthy of flagging and the metric successfully identifies it as having high variability relative to the average waterfall. Likewise, most of what is identified as good appears to be behaving like most of the other antennas. Just as with the previous metrics, some level of inspection of antennas near the cutoff is warranted. Though a range of temporal variation pathologies were noted during the observing and data inspection phase one that stood out was abrupt changes occurring faster auto metrics. Note that each antenna can be flagged by multiple metrics, so the total number of antennas flagged per round is less than the sum of flags per metric. Additionally, antennas may be flagged for one reason during the median-based round and another during the mean-based round, which explains why the total for each metric can possibly exceed the total for the whole round. than the integration time and lasting minutes to hours. Our second metric for anomalous temporal structure looks for such sharp discontinuities, which also cannot be explained by sky rotation. As with our metric for overall temporal variability (see Section 3.3), our metric is based on examining each antenna's waterfall after dividing out the average waterfall of unflagged antennas. Instead of using the median absolute deviation or the standard deviation, which are measures of variability on any timescale, we instead want to detect variability on the shortest timescale-which is the hardest to explain with antennato-antenna primary beam variations . Beginning with the auto-correlation scaled by the median over antennas, we compute the discrete difference along the time axis, and then collapse that waterfall (which is only one integration shorter than the original) along the time axis to a metric spectrum. In our first round of flagging using median statistics, this becomes: where ∆t is our integration time (9.6 s in this data set). Our distance measure, designed to penalize only excessive levels of temporal discontinuities, is the same as in Equation 17 : The adaption to mean-based statistics is straightforward: In Figure 14 riety of strange behavior: some show broadband effects, others are more localized. Antenna 89 and one other even shows spectrally oscillatory levels of temporal discontinuities; we currently have no explanation for this effect. Perhaps these features provide further clues to the ongoing system integration and debugging efforts. The good antennas are fairly tightly clustered around the average, which is spectrally flat. That behavior is expected if the integration-to-integration differences are purely attributable to thermal noise. Normalizing each waterfall by the average good waterfall should cancel out the spectral and temporal dependence of the noise. Given that theoretical expectation this might be the easiest of all the metrics to set an absolute cut, rather than a relative one based on the modified z-score. However, the wide variety of poorly-understood malfunctions combined with the possibility that low-level RFI might still contaminate these metrics complicates that picture. One advantage of the auto-correlation metrics framework is that it is straightforwardly applicable to new combinations of metric spectra and distance measures. For example, it should be noted that the anomalous temporal structure metrics in Sections 3.3 and 3.4 are not exhaustive. By averaging over the whole night, they privilege frequent or persistent effects over infrequent ones. For example, a strong jump in the waterfall like we see in Antenna 110 in Figure 13 that then quickly reverts to "standard" behavior and does not repeat might not be caught by either metric. One could imagine other ways of computing S(ν) or D that up-weight rare excursions from normality. While we continue to assess antenna malfunctions and develop other metrics, it is worthwhile to continue the visual inspection of auto-correlation waterfalls normalized by the average of nominally good antennas to identify other modalities of malfunction. In particular, we find it useful to produce a suite of per-antenna visualizations of the different metric spectra and the corresponding auto-correlation waterfalls. In Figure 15 we show three such examples: one clearly malfunctioning (Antenna 0), one nominal (Antenna 85), and one borderline case that we ultimately flagged (Antenna 24). For each, we show their metric spectra compared to all unflagged antennas, along with the z-scores, highlighting which antennas were automatically flagged. These plots synthesize the information about how discrepant each antenna is along the four axes considered here and help clarify why. In Figure 15 we also show both the waterfalls and the normalized waterfalls, which are divided by the average good waterfall ( Figure 10 ) and then normalized to average to 1. We find it particularly useful to look closely at these normalized waterfalls, especially in borderline cases like Antenna 24. Antenna 24's bandpass shape is sufficiently discrepant with the others to merit an automatic flag, though this does not necessarily mean that it is uncalibratable. More concerning are the abrupt discontinuities at high frequency around 2459122.3 and around 2459122.5. This is precisely the kind of issue we worried about: a strong but rare temporal feature that just barely misses the threshold. Examples like this motivate by-eye inspection of borderline antennas. This is what we have done with recent HERA data. The automatic pipeline produces jupyter notebooks with plots like Figure 15 for all antennas, sorting them by the single highest zscore metric. This makes it easy to find the borderline antennas and decide whether to flag them on a case-by-case basis. There are a number of current and upcoming interferometers with hundreds of antennas aiming to reach the extreme dynamic range necessary to detect and characterize the neutral hydrogen signal from the epoch of reionization. Separating that signal from foregrounds four to five orders of magnitude stronger requires both large volumes of data and the swift and reliable identification of malfunctions that adversely affect data quality. In this work, we report on new metrics which sensitively detect various pathologies and reliably classify them, using HERA data as a case study. In some cases, the precise underlying mechanism (e.g. an antenna with swapped cables for its two polarizations) is known. In others, a physical explanation requires lab and field tests that are beyond the scope of this paper. Armed with per-antenna classifications, instrument teams can more effectively triage issues according to their occurrence rate. In HERA's case, by inspecting the nightly analysis and dashboard reports that implement the metrics outlined here the team can quickly assess the impact of hardware changes. Meanwhile, the definition of metric spectra provides a physically meaningful signature which can be exploited by instrument engineers to identify characteristics like reflections, clipping, interference, and more. The definition of metrics which isolate features of interest and standard ways of displaying them routinely is crucial to managing a large array with a small team. As digital and analog systems grow in capability, arrays will continue to grow in antenna count. Arrays like OVRO-LWA-III (Callister et al., 2019) , DSA-2000 (Hallinan et al., 2021 , HI-RAX (Saliwanchik et al., 2021) , CHORD (Vanderlinde et al., 2019) , PUMA (Castorina et al., 2020) , SKA-Low (Mellema et al., 2013 ) and more will use hundreds to thousands of elements. Ultimately the maintenance time per-antenna imposes a significant design pressure on large arrays. This kind of pressure can also affect arrays with fewer antennas but with more elaborate receivers or wider geographic distributions. A prime example of this regime is the proposed ngVLA (Di Francesco et al., 2019) . With 244 antennas distributed across New Mexico, Arizona, and Mexico, along with outriggers extending to VLBA sites across north America and six cryogenic receivers, operation will require careful minimization of maintenance time. 5 . Quick identification of subtle systematic errors using semi-automatic systems like we describe here are expected to be essential. In 21 cm cosmology experiments, the reliability and precision of arrays will continue to be the dominant factor affecting sensitivity. Identifying, flagging, and ultimately fixing subtle instrument issues will continue to be the first line of defense. Further work in this area is needed, for example, using simulations to replace the detection of relative outliers with absolute thresholds or to replace iterative flagging with a single analysis step. That said, a system like the one presented here will be necessary for triaging malfunctions and extracting science-quality data to form the basis for future cosmology results. to all good antennas in light green, helping us to easily see whether the antenna is an outlier. We also show the full auto-correlation waterfalls, both raw and fractional deviation from the antenna average ( Figure 10 ). The effects detected by our metrics can generally be seen in either the raw or normalized waterfall. Compute the correlation matrix C ij for all baselines, averaging over time and frequency. Compute the median correlation metric for each antenna i by taking the median value of C ij over all other anntennas j. Flag such antennas as completely dead. (Re)compute the per-antenna average cross-correlation metric C i and the perantenna cross-polarization metric R i , averaging over all unflagged antennas. Compute all four median metric spectra over all antennas (separately for both polarizations). Compute median-based distances D i,p between each spectrum and the corresponding median spectrum. Compute a modified z-score for each metric, antenna, and polarization. Flag antenna with the single largest modified z-score over all metrics and both polarizations. Are any antenna z-scores above the round 1 threshold? Average together all autocorrelation waterfalls over unflagged antennas and polarizations. Flag outlier pixels, channels, and integrations with XRFI. Compute all four mean-based metric spectra S i,p (ν) from each autocorrelation waterfall V ii,pp (ν,t). Compute all four mean metric spectra over all antennas (separately for both polarizations). Compute mean-based distances D i,p between each spectrum and the corresponding mean spectrum. Compute a z-score for each metric, antenna, and polarization Flag antenna with the single largest z-score over all metrics and both polarizations. Are any antenna z-scores above the round 2 threshold? Y e s N o Have any antennas been thrown out since the temporal metric spectra were last recomputed? Recompute all mean-based metric spectra S i,p (ν) for temporal variability and temporal discontinuities. Start. Compute all four median-based metric spectra S i,p (ν) from each autocorrelation waterfall V ii,pp (ν,t). Figure A2 . Pseudocode flowchart of the auto-correlation metrics flagging algorithm, as discussed in Section 3. The fhd/eppsilon epoch of reionisation power spectrum pipeline. Publications of the Astronomical Society of Australia First season mwa eor power spectrum results at redshift 7 Van vleck correction generalization for complex correlators with multilevel quantization Foregrounds for observations of the cosmological 21cm line Jacobs, D. C. (2021). A map of diffuse radio emission at 182 mhz to enhance epoch of reionization observations in the southern hemisphere A first search for prompt radio emission from a gravitational-wave event Packed Ultra-wideband Mapping Array (PUMA): Astro2020 RFI Response. arXiv e-prints Dual polarization measurements of mwa beampatterns at 137 mhz Systematic effects in LOFAR data: A unified calibration strategy Hydrogen Epoch of Reionization Array (HERA) The Next Generation Very Large Array Redundant-baseline calibration of the hydrogen epoch of reionization array Redundant Array Configurations for 21 cm Cosmology The Hydrogen Epoch of Reionization Array Dish. II. Characterization of Spectral Structure with Electromagnetic Simulations and Its Science Implications First limits on the 21 cm power spectrum during the epoch of x-ray heating Understanding the hera phase i receiver system with simulations and its impact on the detectability of the eor delay power spectrum Cosmology at low frequencies: The 21cm transition and the high-redshift universe The DSA-2000: A Radio Survey Camera First results from hera phase i: Upper limits on the epoch of reionization 21 cm power spectrum Calibration and 21-cm power spectrum estimation in the presence of antenna beam variations Absolute Calibration Strategies for the Hydrogen Epoch of Reionization Array and Their Impact on the 21 cm Power Spectrum Mitigating Internal Instrument Coupling for 21 cm Cosmology. I. Temporal and Spectral Modeling in Simulations Mitigating Internal Instrument Coupling for 21 cm Cosmology. II. A Method Demonstration with the Hydrogen Epoch of Reionization Array Data analysis for precision 21 cm cosmology Reionization and the Cosmic Dawn with the Square Kilometre Array Reionization and Cosmology with 21-cm Fluctuations White rabbit: Sub-nanosecond timing distribution over ethernet Calibrating chime: a new radio interferometer to probe dark energy. Ground-based and Airborne Telescopes V The GMRT Epoch of Reionization experiment: a new upper limit on the neutral hydrogen power spectrum at z∼ 8.6 The Precision Array for Probing the Epoch of Re-ionization: Eight Station Results Design and characterization of the large-aperture experiment to detect the dark age (leda) radiometer systems 21 cm cosmology in the 21st century Epoch of reionization power spectrum limits from murchison widefield array data targeted at eor1 field An improved source-subtracted and destriped 408 mhz all-sky map Multifrequency analysis of 21 centimeter fluctuations from the era of reionization Implementation of van vleck correction for the mwa Synthesis Imaging in Radio Astronomy II The Murchison Widefield Array: The Square Kilometre Array Precursor at Low Radio Frequencies. PASA, 30 LOFAR: The LOw-Frequency ARray The Canadian Hydrogen Observatory and Radio-transient Detector (CHORD) Canadian long range plan for astronomy and astrophysics white papers The gleam 4-jy (g4jy) sample: I. definition and the catalogue. Publications of the Astronomical Society of Australia The Effects of RFI on 21-cm Measurements of the Epoch of Reionization Quantifying excess power from radio frequency interference in Epoch of Reionization measurements This material is based upon work supported by the National Science Foundation under Grant Nos. 1636646 and 1836019 and institutional support from the HERA collaboration partners. This research is funded by the Gordon and Betty Moore Foundation through grant GBMF5215 to the Massachusetts Institute of Technology. HERA is hosted by the South African Radio Astronomy Observatory, which is a facility of the National Research Data Availability Data used in this paper is available upon request to the corresponding author. Software used in this paper can be found at https://github.com/HERA-Team/hera qm. Figure A1 provides a pseudocode flowchart for the iterative antenna flagging algorithm described in Section 2.3. Figure A2 provides a pseudocode flowchart for the autocorrelation based flagging algorithm described in Section 3.