key: cord-0827236-fq3idpte authors: von Eye, Alexander; Wiedermann, Wolfgang; von Weber, Stefan title: Configural Analysis of Oscillating Progression date: 2021-08-26 journal: J Pers Oriented Res DOI: 10.17505/jpor.2021.23448 sha: 2918b1af3d96e04e35959d06cc741dc2dbb45777 doc_id: 827236 cord_uid: fq3idpte Oscillating series of scores can be approximated with locally optimized smoothing functions. In this article, we describe how such series can be approximated with locally estimated (loess) smoothing, and how Configural Frequency Analysis (CFA) can be used to evaluate and interpret results. Loess functions are often hard to describe because they cannot be represented by just one function that has interpretable parameters. In this article, we suggest that specification of the CFA base model be based on the width of the window that is used for local curve optimization, the weight given to data points in the neighborhood of the approximated one, and by the function that is used to locally approximate observed data. CFA types indicate that more cases were found than expected from the local optimization model. CFA antitypes indicate that fewer cases were found. In a real-world data example, the development of Covid-19 diagnoses in France is analyzed for the beginning period of the pandemic. In many contexts of real-world data analysis, series of observations do not follow a simple straight-lined or smoothly curved progression. Instead, they suggest ups and downs and oscillations of various sorts. Examples of such progression include brain waves, stock market development, heart beats, and traffic load on bridges. Another empirical example (which will be analyzed in detail later in this article) can be found in the progression of positive Covid-19 diagnoses. Figure 1 displays this development over 83 days, beginning with the first diagnosis in France on January 24, 2020 (Santé Publique, 2020) . Figure 1 shows that there clearly is a strong, accelerated upwards trend for the first 36 days that is followed by a downward trend. In addition, this trend is not smooth, in particular after the number of positive diagnoses has reached its first peak, at day 36. In this article, we (1) propose approximating this kind of series of observations with loess smoothing. We (2) also propose using Configural Frequency Analysis (CFA) to evaluate the smoothed curve, and to identify segments in which an otherwise well-functioning selection of smoothing para-meters fails to properly describe the series. This article is structured as follows. First, we provide an overview of loess smoothing. Second, we discuss evaluation and interpretation of loess-smoothed curves with CFA. Third, we analyze in detail the data presented in Figure 1 . Originally (Cleveland, 1979; Cleveland, & Devlin, 1988 ; see also Wilkinson, Blank, & Gruber, 1996) , the smoother to be discussed in this article was called lowess, brief for locally weighted scatterplot smoother. The method is also known as Savitzky-Golay filter (Savitzky & Golay, 1964) . Now, the very same method is called locally weighted regression smoothing or locally weighted scatterplot smoothing, and is abbreviated with loess. To introduce this smoother, we follow Dagum and Luati (2003; see also Dagum & Luati, 2000 , 2001 and first note that a series of scores can have a global and a local representation. When the global representation is used, one model is used for the entire series of scores. For example, one straight regression line or polynomial is used to describe the entire series (for the use of polynomials in CFA, see, e.g., von Eye & Wiedermann, 2021). When local representation is used, the series of scores is subdivided in sectors (the windows), and regression lines, polynomials, or other functions are used to describe the scores in each sector. Typically, the function that is used is the same for all sectors. In standard applications of the General Linear Model (GLM; e.g., regression or ANOVA applications), weights can be used to reduce the error of a model. When ordinary least squares is used to estimate GLM parameters, the parameter vector, β, is where X is the design matrix, Y denotes the observed scores, and W is the weight matrix. This equation holds for both global and local representations of series of scores. The design matrix contains the scores of the predictors and/or the coefficients of the polynomials used to model the series of scores. For example, a regression in which a square curvature is fitted, could be where the x are the predictor scores, and the b are the regression parameters. In the context of modeling series of scores, the x often are the observation points in time. When a global representation of a series is intended, this type of model is applied to the entire series of scores. When regression weights are used, these are often the scores of third variables, observed at the same points in time. When, however, loess smoothing is used, two elements of this approach differ. First, the regression model is applied only to the scores that lie within a predefined window. The size of the window is a priori determined and can be measured in units of number of observation points. Second, the weights are also measured in units of observation points. More specifically, the observation points close to the data point to be approximated receive greater weights than the observation points farther away. In more technical terms, let the observation point to be modeled be tj, where numbers the observation points, and let tk be the observation points that are time adjacent and within the window. Then, one weight wk of tk is, for tj, where � � = � − �. Evidently, the weight depends on the size of the window and, thus, the distance between observation points within this window. Alternatively, and traditionally, the weight function for loess smoothing is the tri-cube weight function, In loess smoothing, each window is of equal width. That is, each window contains the same number of points. The smoothing parameter is the portion of observation points within a window. The wider the window, the smoother the fitted curve. When the absolute value of the distance tk -tj is scaled to range from 0 to 1, the tri-cube weight function assumes the form depicted in Figure 2 . Tri-cube weight function for 0 ≤ |d| ≤ 1 Figure 2 indicates that the weight decreases as the distance increases. Within each window, the same polynomial regression is performed. The degree of the polynomial is determined a priori and is bounded by the width of the window. To prevent hitting the points exactly -which would counter the idea of smoothing -the polynomial should be of a degree of 'number of observation points within the window -2.' Dagum and Luati (2003) state that low degree polynomials, e.g., linear or quadratic polynomials, are usually the best choice. Higher degree polynomials are suitable in particular when the smoothing is expected to reflect at least some of the observed oscillations, and when the window includes many data points. The width of the window is also determined a priori. Selecting narrow windows results in a curve that oscillates with the observed scores. Selecting a wide window will result in a very smooth curve with no oscillations. Given the width of the window, the weights, the degree of the polynomial, and the magnitude of the smoothing parameter, the weighted least squares regression parameter for window j is estimated as That is, one weighted polynomial regression is estimated for the series of scores in each window. Configural frequency analysis (CFA; Lienert, 1968; for technical elements and detail on CFA and recent developments, see von Eye et al., 2010 , von Eye & Wiedermann, 2021 , and Wiedermann et al., 2021 allows one to identify strong deviations from model-based expected frequencies. When, locally, significantly more cases are observed than expected based on the model, a CFA type has been identified. When fewer cases are observed, a CFA antitype has been identified. With only very few exceptions, CFA is applied to the analysis of cross-classifications of two or more variables. The exceptions are all analyses of series of scores. The approach proposed in this article belongs to this group. CFA of individual series of scores was proposed by von Eye (2002) as unidimensional CFA. In this approach, frequencies for a series of scores are estimated either to be constant, as in zeroorder CFA, or with weights that account for a priori expected variations in a series. The difference to the approach proposed here is that, in unidimensional CFA, just as in all other models of longitudinal CFA, the expected cell frequencies are estimated based on a global representation of series of scores. Here, local representation is used. Specifically, loess smoothing is used to estimate the expected frequencies for each observation. As was explained in the previous section on loess smoothing, the frequencies are estimated based on the same model for each time window. This model uses specifications made before analysis, that is, width of window, smoothing parameter, and degree of polynomial. However, the model remains the same over all time windows. This approach is novel in the following two respects. First, instead of using the entire pool of data points, the estimation of expected frequencies uses only a selection of data points. Therefore, CFA types and antitypes indicate not only local deviations from expectancy but they are also local in the sense that the estimation of expected cell frequencies used only local information. The range of information that is used depends on the width of the time window that is used for loess modeling. Second, types and antitypes do not indicate local contradictions to a global model. Instead, they indicate that a smoothing model that is a priori specified without using information from the data to be analyzed fails to describe the series within a particular time frame. We call the approach proposed here loess CFA. It has the following characteristics: • it is nonparametric: no assumptions are made concerning the distribution of the data when the loess smoother is applied; • it is a local regression approach: standard regression methods with least squares estimation are used for each time window (there exists a number of alternative regression models that can be considered as well); • it, therefore, shares the characteristics of least squares regression such as sensitivity to outliers (note that robust loess smoothing approaches have been proposed); • it is easily implemented, in particular when the observation points are equidistant (approaches to loess smoothing for non-equidistant observation points do exist, but they are computationally more intensive); • it is more robust than most other time series smoothers; • it requires relatively long series of observations (although there are no solid recommendations in the literature concerning the number of observations needed); • the same significance tests for detecting types and antitypes can be used as in standard CFA (see von Eye & Wiedermann, 2021) , and • the same methods of α protection can be used as well; the number of tests is given by the number of observation points. Data example. In the following paragraphs, we resume the analysis of the series of positive Covid-19 diagnoses that had been used for Figure 1 . We proceed as follows. First, we illustrate that a global representation of these data is not satisfactory. Second, we illustrate the smoothing effects of time windows that differ in width. Third, we select a loess smoothing approach and subject the resulting expected frequencies to loess CFA. To illustrate that global representation can be less than satisfactory, we just use a quadratic function. Figure 3 shows the result. Figure 3 suggests that the global representation of the series of score by a quadratic curve describes the data poorly. The up and down is represented in part, the oscillations are not represented at all. This visual impression is supported by the non-significant correlation between observed and expected frequencies, r = -0.147 (p = 0.184). Adding to this model a linear trend increases the correlation to r = 0.777 (p < 0.01), but the data are still not well represented in the sense that there are large differences between expected and observed frequencies. We, therefore, move to loess smoothing. Here, we first illustrate that, whereas a wider window frame results in a smooth approximation curve, a narrower frame allows one to better capture the oscillations, but at the expense of a less smooth curve. Figure 4 displays the loess-smoothed curve under the following specifications: • each time window contains a portion of 0.4 of all data points, thus allowing only long waves to be represented • data points to the right and to the left are weighted inversely proportional to their distance from the target point by way of the tri-cube weight function • quadratic local regression is used in each window. Number of positive Covid-19 diagnoses in France in the 83 days after the first positive diagnosis; loess smoother Figure 4 shows that the approximation of the observed frequencies with the loess smoother is clearly better than the approximation that is based on the global regression function. The correlation between the observed and the expected frequencies now is r = 0.92, that is, close to perfect. Nevertheless, this result is not acceptable either, for two reasons. First, it estimates negative frequencies for the first five days. In the analysis of frequency data, negative estimates must be avoided. Second, it smooths over the oscillations. It was one of the goals of the present analyses to capture at least some of the variability in the oscillations. For these two reasons, we now select a narrower time window. Specifically, we reduce the portion of data points in the window to 0.05, thus allowing short-length waves of 2 Hz to be represented. All other model specifications are the same. Figure 5 displays the smoothed curve. Evidently, the loess-smoothed estimates now are much closer to the observed frequencies than in Figure 4 . The correlation between the estimated and the observed frequencies now is r = 0.983, very close to perfect. Still, just 12 observed scores are exactly reproduced. This is desired because the smoothed curved is not more parsimonious than the observed curve when it reproduces all data points exactly. Now, in spite of the extremely large correlation, Figure 5 shows that there are differences between the observed and the estimated progression of positive diagnoses. This is where CFA comes into play. CFA allows one to answer the questions of where in the series deviations are significant and how to interpret them. We take the four steps of CFA (von Eye & Wiedermann, 2021). Step 1: Specification of base model. The key element of the interpretation of CFA types and antitypes is the CFA base model. In all applications of CFA thus far, the base model was defined as a probability model that specifies variable relations (cf. von Eye, 2004; von Eye & Gutiérrez Peña, 2004) . In the present article, a new approach is proposed. Specifically, the base model is no longer expressed in terms of variable relations. Instead, the model is defined by the characteristics of the data generation process that is hypothesized to smooth the series. This process has, in the present case, three main characteristics. First, it is unchanged over the entire time series. Second, simple quadratic polynomials are sufficient to describe the local shape of progression. Third, the time window used for the present analysis is relatively narrow. This last characteristic was specified so the oscillating characteristics of the series could be captured. This was deemed necessary because there are systematic changes over the days of the week. These changes reflect characteristics of data collection, that is, on Sundays, fewer tests and diagnoses are performed than on other days of the week. This results in day of the weekrelated oscillations. Figure 4 suggests that the selected model is capable of approximating these oscillations, up to a point. Unchanged from standard CFA, the interpretation of CFA types and antitypes is conducted with reference to the CFA base model. Here, types and antitypes do not suggest that variable relations exist that were not included in the base model. Instead, types and antitypes suggest that the model that was used to represent the local progression is, for a particular target point and its neighborhood, insufficient. In most cases, one can assume that the model is not complex enough. Here, higher order polynomials can be considered. Substantively, one can ask what events caused the locally increased complexity of the progression. Step 2: Significance testing. The numbers of positive Covid-19 diagnoses are large. Therefore, we can use the ztest and protect α with the Bonferroni procedure . The protected significance threshold becomes α* = 0.05/83 = 0.0006. This value corresponds to an absolute z-score of 3.238. Observed z-scores more extreme than this value point at target points that constitute types or antitypes. Step 3: Performing CFA. Table 1 displays the results of CFA. The first column in the table contains the days of observations, beginning with the first day of a positive diagnosis. The second column contains the observed counts. The third column contains the frequencies that were estimated under the wide window (Figure 4; Loess04) . The fourth column contains the frequencies that were estimated under the narrower window ( Figure 5; Loess005) . The values in these two columns correlate to r = 0.929. Again, this is an extremely high correlation. Still, the narrower window results in a much better approximation. In particular, Table 1 shows that all estimates under the narrower window (Loess005) are positive -a must for frequency data. Step 4: Interpretation of types and antitypes. Table 1 shows that nine antitypes and 13 types emerged. Of these, we first interpret the most extreme ones. The most extreme type is constituted by Day 43. 3800 positive diagnoses were counted, but the loess smoother made one expect only 2936.5 cases. The hypothesized data generation mechanism was clearly unable to capture this strong upswing. The most extreme antitype is constituted by Day 35. 4181 positive diagnoses were counted for this day, but 5188 had been expected. Here, the hypothesized data generating mechanism was unable to capture the downswing in the progression. This applies accordingly to the other types and antitypes. Second, we interpret the resulting pattern of types and antitypes. Interestingly, seven of the nine antitypes are constituted by days that are multiples of seven. In addition, these seven antitypes all occur on Thursdays. This is significantly more than expected under the assumption of a uniform distribution of antitypes over the days of a week (p = 0.017). Considering that Santé Publique did not publish any discussion of diagnose patterns or trends, we are in search of an explanation of this result. Among the possible interpretations is the following one. At the beginning of the pandemic, it took about four days before the results of the Covid-19 tests were available. The Thursday antitypes may, therefore, reflect the fact that, on the Sundays before, fewer tests were administered than on other days. The CFA base model of loess smoothing seems to be sensitive to these drops. Types tend to occur more often than average on Fridays. Five of the 13 types occurred on a Friday. This, however, is a non-significant deviation from expectancy. Longer series of data may be needed to determine whether there is a connection of these increases with the drops on Thursdays. In all, we conclude that the rhythm that was hypothesized for the days of the week was much better captured under the narrower than under the wider window, but still not well enough to prevent these seven antitypes and 12 types from emerging. There seems to be a significant pattern only for the antitypes. It this respect, the downswings resulted in a pattern that is harder to capture by the hypothesized data generation mechanism than the pattern of the upswings. Overall, however, more upswings than downswings were missed. It needs to be determined whether this reflects the progression of the pandemic, day of the week-related characteristics of testing, or both. In this article, we propose a new approach to CFA. The new approach differs from existing approaches to CFA in two major aspects. First, thus far in CFA, the logic of searching for types and antitypes consisted of specifying a base model that included all effects, e.g., variable relations, that are not of interest to the researchers. Types and antitypes then suggest where in the data space the effects exist that are of interest to the researchers. The new approach follows this logic. However, instead of specifying a base model of variable relations, it specifies a base model in terms of a data generating process. Types and antitypes then suggest the sector in the data space in which a different or more complex data generation process must be at work. The second major difference between CFA as it was known so far and the new approach is that the new approach is tailored to be applied to series of individual observations, e.g., time series (standard CFA approaches for longitudinal data rely on aggregated counts). To approximate a series, a data generating process is hypothesized that explains the progression locally, that is, within the intervals of an a priori specified width. Just as in earlier approaches to CFA, the number of options to explain the emergence of types and antitypes is limited. In CFA as it was known so far, these options consist of variable relations. For example, in first order CFA, these options include all possible variable interactions. In Prediction CFA (P-CFA), these options include all possible predictorcriterion relations. Here, the options concern the three main elements of loess smoothing, the width of the window, the function used to locally approximate progression, and the weights used for the neighbors of the target point. In the example in the last section, we moved from Figure 4 to Figure 5 by leaving the function and the weights untouched. However, we narrowed the time window. This way, we opened the door to a finer-grained resolution of the smoothing process, and we know that the differences between Figures 4 and 5 are solely due to the narrowed window. In the analysis of time series, a large number of methods has been proposed. A thorough comparison of these methods is beyond the scope of this article. However, it can be said that, some of these, e.g., spline smoothing, can result in very similar results as loess smoothing. In addition, spline smoothing requires less calculation effort than loess smoothing. For the current purposes, loess smoothing seems more useful nevertheless, for two reasons. First, loess smoothing can be tailored so that both the fast and the slow changes are modeled separately. Second, it is more local in the sense that changes within individual and narrow windows can be reflected. This characteristic is of particular importance for CFA. Still, the configural analysis of time series that is based on curves that describe the data generation process in just one function can be an interesting option, one that will be explicated in future work. Generalizations of the proposed new CFA approach are straightforward. The most important could be the generalization to loess-smoothing of more than one variable. This way, in time series, progression hyperplanes would be smoothed instead of progression lines. This is promising but it is material for future work. Other generalizations concern the observation points. In the description of the method and in the example given in this article, the observation points were equidistant. In many empirical studies, equidistant observations are hard to realize. For example, in school settings, series of observations are interrupted because children are on vacation. In psychotherapy, time intervals between therapy sessions can vary depending on the needs of patients, or WiFi reception varies because of interferences. Methods for loess-smoothing for non-equidistant observation points exist and can be incorporated in the loess CFA framework. A third way of generalization of the method proposed here consists of incorporating the new method in the context of existing CFA methods. It is, for example, conceivable, to incorporate loess CFA in the context of moderator CFA. Moderator variables would then be used to explain variations of type/antitype patterns across individuals or groups of individuals. In a similar way, smoothed curves could be generated and subjected to CFA for periods before and after certain interventions. This, again, is material for future work. The three authors contributed equally to this article, and all consent to the publication of this work. The authors declare that there are no conflicts of interests. Lars-Gunnar Lundh served as action editor for this article. This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. Robust Locally Weighted Regression and Smoothing Scatterplots Locally-Weighted Regression: An Approach to Regression Analysis by Local Fitting Predictive Performances of Some Linear and Nonlinear Smoothers for Noisy Data. Statistica, LX A Study of the Asymmetric and Symmetric Weights of Kernel Smoothers and their Spectral Properties Global and local statistical properties of fixed-length nonparametric smoothers Die "Konfigurationsfrequenzanalyse" als Klassifikationsmethode in der klinischen Psychologie. Paper presented at the 26. Kongress der Deutschen Gesellschaft für Psychologie COVID_19: point épidémiologique du 10 mars 2020 Smoothing and Differentiation of Data by Simplified Least Squares Procedures Base models for Configural Frequency Analysis Configural Frequency Analysis -the search for extreme cells Advances in Configural Frequency Analysis Configural frequency trees. Development and Psychopathology Desktop data analysis with SYSTAT