key: cord-0295370-cr5d4lpo
authors: Symul, L.; Holmes, S. P.
title: Labeling self-tracked menstrual health records with hidden semi-Markov models
date: 2021-01-13
journal: nan
DOI: 10.1101/2021.01.11.21249605
sha: 0ae3c3097302d54f495e947c746cfb34a961c225
doc_id: 295370
cord_uid: cr5d4lpo

Globally, millions of women track their menstrual cycle and fertility via smartphone-based health apps, generating multivariate time series with frequent missing data. To leverage data from self-tracking tools in epidemiological studies on fertility or the menstrual cycle's effects on diseases and symptoms, it is critical to have methods for identifying reproductive events, e.g. ovulation, pregnancy losses or births. We present two coupled hidden semi-Markov models that adapt to changes in tracking behavior, explicitly capture variable-- and state-- dependent missingness, allow for variables of different type, and quantify uncertainty. The accuracy on synthetic data reaches 98% with no missing data, 90% with realistic missingness, and 94% accuracy on our partially labeled real-world time series. Our method also accurately predicts cycle length by learning user characteristics. It is publicly available (HiddenSemiMarkov R package) and transferable to any health time series, including self-reported symptoms and occasional tests.

Given the generative nature of our model, the synthetic dataset is simulated from our coupled HSMMs. Synthetic time series 93 are generated with various amounts of missing data so that the e ect of tracking frequency on accuracy can be evaluated 94 (Methods). Table 1 : Tracking behavior of the users in our dataset. The rst row provides descriptive statistics of the overall tracking frequency, which is de ned as the ratio between the number of days in which the user tracked any feature in the app and the duration (in days) between their rst and last log. The next four rows provide feature-speci c tracking frequencies, i.e. the ratio between the number of days the user tracked this speci c feature and the duration of app use. The last two rows provides descriptive statistics about the duration of the longest continuous stretch of time users did not report any feature (missing) or reported as least one feature every day (tracking). 3 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted January 13, 2021. ; In the bleeding line (top), gray lines indicate 'no bleeding', orange/red lines mean that bleeding was reported. Darker reds indicate heavier bleeding ow. In the LH and pregnancy test lines (2nd and last lines), red lines indicate negative tests, blue lines positive ones. Temperature is depicted by a gradient ranging from blue for temperatures below the user's median value to red for temperatures above the median value. No mucus (3rd line) is depicted by gray lines, while fertile mucus is indicated in blue, sticky mucus in yellow and creamy one in beige.

Generative models of the female reproductive cycles and of users' tracking behavior 97 The self-reported time series result from two distinct generative processes: the rst one generates the values of the ob-98 served variables and the second one generates the censoring (the pattern of missing data). Here, we hypothesize that the 99 rst process only depends on the biological state, while the probability of missing data depends on both the biological state 100 of an individual and on the tracking goals of that individual. For example, the probability of a positive pregnancy test only 101 depends on whether the individual is pregnant (assuming no false positives), however, the probability of reporting a positive 102 pregnancy test is much higher in the rst month of pregnancy than in the last month of pregnancy. Similarly, a user who 103 wishes to achieve or avoid a pregnancy is likely to report their cervical mucus and/or temperature more frequently than a 104 user whose purpose is simply to keep track of their period. While, in principle, the tracking behavior does not a ect an indi-105 vidual's biology, our ability to detect speci c reproductive events depends on the tracking behavior. In general, the more a 106 user tracks, the more information we have about their biological state, and the better speci c events can be identi ed from 107 their data. Inversely, if, for example, a user only tracks their bleeding, it may be impossible to di erentiate early pregnancy 108 losses from long cycles. Speci ed model graph of the HSMM used to label users time series with a speci c tracking behavior. 'b' indicates tracking of bleeding only, 'bp' means tracking of bleeding and pregnancy tests, 'b test' tracking of bleeding and both ovulation and pregnancy tests, 'no mucus' indicates that the user is not tracking their mucus, in 'no temp' they are not tracking their temperature and in 'full', they are tracking all ve possible variables. This does not require them to track these features frequently, but at least occasionally within a few months. (c) Graph of the speci ed HSMM for modelling reproductive events. Arrows indicate possible transitions, their width is proportional to the transition probability. This graph should be read starting from the red circle ('M' for menses). From 'M', a rst loop matches the 6 states de ning ovulatory cycles ('lE', 'hE', 'pre-O' are follicular phase states, 'O' stands for ovulation, and 'post-O' and 'Lut' are luteal phase states). After ovulation, a pregnancy may start ('P) and end-up in a loss ('PL'-'L'-'lEpL' loop) or in a birth ('PB1'-'PB2'-'PB3'-'B'-'PP' (post-partum without breast-feeding) or 'BF' (breast-feeding) loops). Finally, two anovulatory states are de ned: 'AB' for anovulatory with bleeding and 'Ano' for anovulatory without bleeding. See Methods and Supplementary Material for state de nition and descriptions. (d) Prior and initial sojourn distributions for states which do not have a xed duration.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted January 13, 2021. ; https://doi.org/10.1101/2021.01.11.21249605 doi: medRxiv preprint

Our results on a synthetic dataset (Fig 3) show that our method is able to recover the ground truth with an accuracy of 98% 115 when variables are always reported (persistent tracking, no missing data). This provides an upper-bound on our ability to 116 decode real-world time series.

As expected, we observe a higher accuracy when variables are reported more frequently (less missing data, see tracking 118 categories on the x-axis of Fig. 3b-c) and when more variables are reported, e.g. tests results are reported in addition to 119 bleeding (see colored lines in Fig 3b and Methods for the de nition of tracking categories and the speci cation of missing 120 patterns). With time series mimicking the expected tracking behavior of a user whose purpose is to identify their fertile 121 window and pregnancies early on, the accuracy is 93%. The accuracy is of 85% when the tracking behavior is "occasional", 122 i.e. with an average tracking frequency of about 10% (Fig 3b and Supplementary Material).

The weighted accuracy, i.e. the accuracy weighted by the uncertainty on the labels at each time-point (see Methods), is 124 higher than the non-weighted accuracy (Fig 3b) , which indicates that, as desired, our method provides a higher uncertainty 125 on labels that are di erent from the ground-truth. Our method is able to warn against potential labeling mistakes.

States recovered with the highest accuracy are the menses and pregnancy states (Fig 3c, Supplementary Materials) . The 127 states su ering the most from a low tracking frequency are the states surrounding ovulation. Indeed, without a high tracking 128 frequency, it is impossible to pin-point the day of ovulation. A low accuracy is expected for these states when tracking 129 frequency is low. 6 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted January 13, 2021. ; https://doi.org/10.1101/2021.01.11.21249605 doi: medRxiv preprint

To quantify the performances on our real-world dataset, we used the interactive app embedded in our HiddenSemiMarkov 132 R package to de ne some ground-truth and label about 11% of our dataset. These labels were independently validated 133 by a fertility awareness methods expert (see Methods and Acknowledgments) and are visualized for the full dataset in the 134 Supplementary Material. Fig 4a provides an example of how a real-world user time series is labeled with our model.

Our results on the partially labeled Kindara dataset are similar to those obtained on simulated data. Menses and pregnancy 136 states are well recovered, while states surrounding ovulation are less accurately recovered (Fig 4b) . Here as well, the overall 137 weighted accuracy (94.5%) is higher than the non-weighted one (93%). This also holds true for the state-speci c accuracies 7 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted January 13, 2021. ; https://doi.org/10.1101/2021.01.11.21249605 doi: medRxiv preprint Unsupervised labeling of self-reported health records with biologically-relevant states is a challenging problem given the 149 high frequency of missing data, the changes in tracking behavior and the multivariate nature of the data. Here, we presented 150 a hierarchical generative method based on two hidden semi-Markov models. Our results on synthetic data and real-world 151 data, here self-reported fertility body-signs, show accurate recovery of the hidden states sequence. In addition to return-152 ing the most likely sequence of hidden states, this framework returns the likelihood at each time-point of this most likely 153 sequence. Our results show that the decoding accuracy is higher when the likelihood is high which implies that our model 154 is able to adequately quantify uncertainty. In contrast, most medical or psychological studies currently use methods which 155 are unable to quantify the uncertainty or the likelihood of their estimates, such as manual labeling or deterministic rules, 156 to identify the timing of reproductive events such as ovulation day. Our method both labels retrospective time series and 157 predicts cycle lengths of ongoing cycles. The method performs 2.88 times better than the baseline method for users with 158 irregular cycles.

Beyond modeling reproductive events, our adaption of hidden semi-Markov models allows (i) for missingness patterns that which can be used to manually label time series and/or con rm predicted labels. This interactive app can be used to create 173 some ground-truth for time series or to use an interactive boosting approach to accelerate the tting process.

This package, in combination with the proposed reproductive model presented here, provide ready-to-use o -the-shelf 175 tools that any scientist interested in studying health and biological variations associated with the menstrual cycle can use.

For example, the labeling method presented here can be used to label large retrospective dataset from menstrual cycle 177 tracking app and evaluate the changes in reported symptoms at speci c phases of the cycle before and after pregnancies. 178 And, while users in our dataset were naturally cycling, the proposed reproductive model could be extended to allow for 179 the detection of birth-control changes. Therefore, reported symptoms could be compared before and after birth control is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted January 13, 2021. ; https://doi.org/10.1101/2021.01.11.21249605 doi: medRxiv preprint cycle. Our framework could thus encourage medical researchers to partner with a tracking app or include a few questions 185 related to participants' fertility, such as contraceptive use, and their menstrual cycle, such as daily report of their bleeding.

This would ensure a more comprehensive understanding of the e ect of sex as a biological variable on the course of chronic 187 diseases, the e cacy of treatments or in epidemiological studies.

Beyond the menstrual cycle, the proposed model is ideal for decoding any self-reported time series such as physical activity 189 patterns, or time series of incomplete diagnosis data. As an example from the current pandemic, our hidden semi-Markov 190 model could be tted to datasets of covid-19 test results and reported symptoms to identify the di erent phases of infection 191 from "uninfected" to "recovered" over "incubating" and "infectious". Another application of our hidden semi-Markov model CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted January 13, 2021. ; https://doi.org/10.1101/2021.01.11.21249605 doi: medRxiv preprint Hidden semi-Markov models 199 First, we introduce the notation and de nitions in the case of a univariate non-censored time series, we then adapt these to 200 multivariate time series, to censored time series and to time series where some ground-truth is available.

Notation and model parameters 202 The random variable is measured at a sequence of time-points and may be a discrete, continuous or categorical variable.

It is either observed, taking a value , or missing. The hidden state is a discrete, non-observed, random variable whose 204 value is one of the states of the model. We will use the shorthand notation X for a sequence of observations of length 205 and de ne it as an ordered set of values of : X = ( 1 , 2 , … , ). A sequence of hidden state is written as s =

( 1 , 2 , … , ); or is the shorthand notation for = or = . is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted January 13, 2021. ; https://doi.org/10.1101/2021.01.11.21249605 doi: medRxiv preprint package, with a minor correction of the backtracking step for the Viterbi algorithm. 

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted January 13, 2021. ; https://doi.org/10.1101/2021.01.11.21249605 doi: medRxiv preprint Self-reported health records are subject to high level of missingness. Typically, the tracking behavior shows high variability 259 between users and within users. The tracking frequency of a user changes over time, both at short term, i.e. within the 260 menstrual cycle, but also at long term. For example, as users change their reproductive objectives, they may start or stop 261 tracking speci c body-signs. Missing observations may be modeled as a two-step process: rst, users must open the app on 262 a given day, and second, they must measure and report a speci c variable on that day. Both processes can be modeled as 12 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted January 13, 2021. ; https://doi.org/10.1101/2021.01.11.21249605 doi: medRxiv preprint where is the dirac function and is a parameter that can be speci ed and that represents the trust in the provided ground 287 truth. takes value between 0 (no trust) and 1 (full trust). In the M-step, the posterior state probabilities are adapted as We speci ed a 19-state model composed of 2 main loops (Fig. 2c) . The rst loop is a 7-state chain describing the succes- LH test or a rising temperature, would be reported. The second scenario corresponds to cycles in which a low temperature 303 is reported consistently between two bleeding episodes without abnormal bleeding being reported.

All state transitions are uni-directional except for the transition between the "high estrogen" state and the "low estrogen" 305 state. This transition is initialized with a low probability and allows the description of cycles typically experienced by users 306 su ering of poly-cystic ovary syndrome (PCOS). We expect this "hE-lE" transition probabilities to become very low when 307 tting our initial model on time series reported by healthy users, and to become higher when tting time series reported by 308 PCOS users or users that experienced delayed ovulation for other reasons. series at once, we use a hierarchical approach (see Fig. 2a ). First we categorize user's time series into subsequences with 316 distinct tracking behavior categories, then, if needed, we simplify our reproductive event model to match the information 317 available in each subsequence before predicting their hidden state sequence.

To categorize user's tracking behavior, we make two assumptions. First, we assume that users are consistent for several 319 months, i.e. same and , . Second, we assume that they are likely to change their tracking behavior around the time of 320 13 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted January 13, 2021. ; https://doi.org/10.1101/2021.01.11.21249605 doi: medRxiv preprint One-day state preceding ovulation to account for a higher probability of a positive LH test compared to previous days.

Ovu One-day state during which an egg is released from one of the ovaries.

6 postO Ovu + 2 Two-day state during which the mucus becomes less transparent and more sticky and during which the wake-up body temperature increases. 7 Lut Luteal Temperature is high (from progesterone), mucus is absent, sticky or creamy and spotting may be reported towards in the last days. 8 Ano Anovulatory cycle

The anovulatory cycle state follows the follicular phase if a low temperature is consistently observed until the next menses. 9 AB Anovulatory with bleeding

The anovulatory with bleeding state describes anovulatory cycles in which the users experience quasi-constant bleeding. It follows the menses and the temperature is low throughout this state. 10 P Implantation (Pregnancy)

The implantation state follows the post-ovulation state. Temperature is high and pregnancy tests may be positive 10-15 days after fertilization (ovulation).

After implantation, a pregnancy can end in a spontaneous or induced loss.

The embryo leaves the uterus, bleeding is usually reported. 13 lEpL Post-Loss

The post-loss state is similar to the early follicular phase except that pregnancy tests may still be positive (residual HCG in the urine). 14 PB1 Pregnancy with Birth -1st trimester

Pregnancies ending with a birth are divided into three trimesters. Temperature is very high in the 1st trimester. It has a xed duration of 84 days (12 weeks). 14 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted January 13, 2021. ; https://doi.org/10.1101/2021.01.11.21249605 doi: medRxiv preprint only a few days. Any switch between tracking behavior must be done via this transition state whose emission probabilities 331 are speci ed to coincide with menses as identi ed earlier (see Fig. 2b ). The tracking-behavior HSMM is speci ed to decode 332 time series that are composed of ve binary variables: the rst one re ects whether a period was detected by the simpli ed 333 reproductive model, while the other fours re ects the presence or absence of the four variables (all besides bleeding) that 334 may be reported by the app users.

Once subsequences with consistent tracking behavior have been identi ed, the reproductive HSMM is slightly modi ed to 336 match the information contained in each subsequence. These modi cation consists in one or a combination of three steps.

First, unobserved variables are removed from the model emission distributions. Second, if two states (or two loops) may 338 not be distinguishable from the available observations, we set to zero the transition probability for the least common state 339 or loop. For example, anovulatory cycles may not be distinguishable from ovulatory cycles without temperature records; 340 in the absence of temperature, the transition probability to the anovulatory state is set to zero. And third, we initialize to 341 a xed duration the sojourn distributions of states whose variation in duration can only be infered from the value of given The dataset was lightly pre-processed before being labeled using our hidden semi-Markov models we de ne in the next To test our framework and its ability to label real-world time series, we attempted to generate synthetic data that would . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted January 13, 2021. ; https://doi.org/10.1101/2021.01.11.21249605 doi: medRxiv preprint more frequently than others. To create these ve levels, we rst specify censoring probabilities for each state and variables 366 which would mimic the tracking behavior of a diligent user who wish to identify their fertile window and pregnancies early 367 on. We then increase or decrease these probabilities to mimic realistic levels of tracking assiduity (see Supplementary Ma-368 terial). The last step of the data generation process aims at re ecting that (a) users' tracking frequency may change over 369 time and (b) that not all users track the same set of variable. For example, some users only track their bleeding, while others 370 track their mucus and ovulation tests. We thus divided the users' time series into an arbitrary number of sub-sequences.

Each of these sub-sequences is randomly assigned to a set of tracked features. To simplify the interpretation of the results, 372 4 groups of realistic sets of tracked features are de ned: in the rst category, users only report their bleeding patterns; in 373 the second category, their only report bleeding patterns and pregnancy test results; in the third category, they report bleed-374 ing, mucus and temperature patterns and in the fourth category, they report all variables. We compute the performance 375 metrics de ned above as a function of the tracking frequency and the set of reported variables.

Labeling performance evaluation 377 We evaluate the performances of our framework by measuring its ability to recover simulated or manually labeled ground-378 truth. We compute the accuracy de ned as = 1 ∑ =1 (ŝ − s * ) and the weighted accuracy = ∑ =1 (ŝ −s * )

where the weights = ( ( = ) =ŝ |X; ) are the posterior state probabilities for the most likely state. We also 380 compute a state-speci c weighted accuracy which is de ned as = 1 ∑ =1 (ŝ − s * ) ( − s * ).

Predicting the next period date in ovulatory cycles.

In addition to labeling user's time series, our model can also be used to predict the date of the next period. While some 16 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted January 13, 2021. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted January 13, 2021. ; https://doi.org/10.1101/2021.01.11.21249605 doi: medRxiv preprint

Do sexually transmitted infections exacerbate negative premenstrual 407 symptoms? Insights from digital health

Menstrual cycle, sex hormones in female in ammatory bowel disease pa-410 tients with and without surgery

Mucus observations in the 412 fertile window: A better predictor of conception than timing of intercourse

Symptoms and Hormonal Changes Accompanying Ovulation

The Lancet

Real-world menstrual cycle 417 characteristics of more than 600,000 menstrual cycles

Stylized facts of nancial time series and hidden semi-Markov models

Hormonal Factors Involved in the Regulation of Basal Body Temperature During the 422

E ects of the menstrual cycle on bipolar disorder

Self-reported aring varies during the menstrual cycle in 426 systemic lupus erythematosus compared with rheumatoid arthritis and bromyalgia

The authors thank the fertility tracking app Kindara as well as their users. They also thank Valentina Salonna, fertility aware-400 ness counselor certi ed by the Symptotherm foundation (Switzerland), for validation of the manual labels on the Kindara